Ansible

Continue evolving open source in the cloud with Red Hat and Microsoft

2025-11-20 · Microsoft Ignite 2025

theater

by Aaron Isom (Red Hat) , Mark Heslin (Red Hat)

AI/ML Azure Cloud Computing Linux Microsoft

Join Red Hat’s Aaron Isom and Mark Heslin as they detail how Red Hat and Microsoft transformed into one of the most highly successful partnerships with over 10 years of collaboration in cloud computing and services. The session will highlight product offerings, benefits, future investments, and advancements, including OpenShift Virtualization, AI, Ansible and Red Hat Enterprise Linux (RHEL) specifically optimized for performance on Azure.

Fabric-X Deployment Strategies

2025-10-21 · Fabric-X – Programming Model and Application Development Deep Dive

talk

by Marcus Brandenburger (IBM Research) , Angelo de Caro (IBM Research) , Arne Rutjes (IBM) , Elli Androulaki (IBM) , David Viejo (ChainLaunch) , Pasquale Convertini (IBM)

Kubernetes fabric-x hyperledger fabric

Practical deployment strategies using Ansible and Kubernetes to bring Fabric-X solutions into production.

Fabric-X Deployment Strategies: Ansible and Kubernetes

2025-10-21 · Fabric-X – Programming Model and Application Development Deep Dive

talk

by Marcus Brandenburger (IBM Research) , Angelo de Caro (IBM Research) , Arne Rutjes (IBM) , Elli Androulaki (IBM) , David Viejo (ChainLaunch) , Pasquale Convertini (IBM)

Kubernetes fabric-x

To close the session, we’ll walk through practical deployment strategies using Ansible and Kubernetes, equipping you with the tools and confidence to bring your Fabric-X solutions into production.

Fabric-X Endorsement Phase Deep Dive

2025-10-21 · Fabric-X – Programming Model and Application Development Deep Dive

talk

by Marcus Brandenburger (IBM Research) , Angelo de Caro (IBM Research) , Arne Rutjes (IBM) , Elli Androulaki (IBM) , David Viejo (ChainLaunch) , Pasquale Convertini (IBM)

Kubernetes fabric-x hyperledger fabric tokenization

A session exploring the Fabric-X endorsement phase, how it differs from the traditional Hyperledger Fabric model, and implications for developers. We'll cover tokenization use cases, hands-on examples, and practical deployment strategies using Ansible and Kubernetes.

Fabric-X Endorsement Phase Deep Dive

2025-10-21 · Fabric-X – Programming Model and Application Development Deep Dive

talk

by Marcus Brandenburger (IBM Research) , Angelo de Caro (IBM Research) , Arne Rutjes (IBM) , Elli Androulaki (IBM) , David Viejo (ChainLaunch) , Pasquale Convertini (IBM)

Blockchain Kubernetes fabric-x hyperledger fabric tokenization

A session focusing on the endorsement phase of Fabric-X, comparing it to traditional Hyperledger Fabric, with hands-on examples showing how the new model streamlines development for tokenization use cases and on-chain asset transfer. The session will also cover practical deployment strategies using Ansible and Kubernetes.

Data Engineering for Cybersecurity

2025-08-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Bonifield

Cloud Computing Data Engineering ELK Git Kafka Linux Logstash PowerShell Redis Cyber Security Data Streaming data +1 more

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Migrating and optimizing workloads on Google Cloud

2025-04-10 · Google Cloud Next '25

session

by Hicham Mourad (Red Hat) , Steve Richardson (Google Cloud) , Matthew Packer (Red Hat)

Cloud Computing GCP

Automation can help you migrate, coordinate and manage workloads in cloud environments in a simple and effective way, reducing complexity and operational costs. In this session we will cover: - How to increase speed and reduce errors as you migrate to Google Cloud - How Red Hat Ansible Automation Platform simplifies managing your Google Cloud infrastructure - Getting started automating Google Cloud cloud services with certified automation content for migration and management

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Technical Deep Dive: Ansible & Terraform - PB&J of EnterpriseIT Automation (Current & Future)

2024-10-16 · Ansible Automates NYC

talk

Terraform

Continuous Network Compliance (compliance-as-code): How organizations should leverage event-driven Ansible (EDA) to respond to changing configurations and compliance in real time

2024-10-16 · Ansible Automates NYC

talk

Real world use cases: Ansible Automation Platform customer panel discussion

2024-10-16 · Ansible Automates NYC

panel

Accelerate automation with generative AI and Ansible's enhanced developer experience

2024-10-16 · Ansible Automates NYC

talk

AI/ML GenAI

Real world story of moving to Managed Ansible in the Cloud from self-managed AAP

2024-10-16 · Ansible Automates NYC

talk

Cloud Computing

IBM PowerVC Version 2.0 Introduction and Configuration

2021-05-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Thierry Huché , Sachin P. Deshmukh , Scott Vetter , Stephen Lutz , Christopher Emefiene Osiegbu , Ahmed Mashhour , Borislav Ivanov Stoymirski

Cloud Computing IBM Linux Cyber Security Terraform Virtual Machine data data-engineering

IBM® Power Virtualization Center (IBM® PowerVC™) is an advanced enterprise virtualization management offering for IBM Power Systems. This IBM Redbooks® publication introduces IBM PowerVC and helps you understand its functions, planning, installation, and setup. It also shows how IBM PowerVC can integrate with systems management tools such as Ansible or Terraform and that it also integrates well into a OpenShift container environment. IBM PowerVC Version 2.0.0 supports both large and small deployments, either by managing IBM PowerVM® that is controlled by the Hardware Management Console (HMC), or by IBM PowerVM NovaLink. With this capability, IBM PowerVC can manage IBM AIX®, IBM i, and Linux workloads that run on IBM POWER® hardware. IBM PowerVC is available as a Standard Edition, or as a Private Cloud Edition. IBM PowerVC includes the following features and benefits: Virtual image capture, import, export, deployment, and management Policy-based virtual machine (VM) placement to improve server usage Snapshots and cloning of VMs or volumes for backup or testing purposes Support of advanced storage capabilities such as IBM SVC vdisk mirroring of IBM Global Mirror Management of real-time optimization and VM resilience to increase productivity VM Mobility with placement policies to reduce the burden on IT staff in a simple-to-install and easy-to-use graphical user interface (GUI) Automated Simplified Remote Restart for improved availability of VMs ifor when a host is down Role-based security policies to ensure a secure environment for common tasks The ability to enable an administrator to enable Dynamic Resource Optimization on a schedule IBM PowerVC Private Cloud Edition includes all of the IBM PowerVC Standard Edition features and enhancements: A self-service portal that allows the provisioning of new VMs without direct system administrator intervention. There is an option for policy approvals for the requests that are received from the self-service portal. Pre-built deploy templates that are set up by the cloud administrator that simplify the deployment of VMs by the cloud user. Cloud management policies that simplify management of cloud deployments. Metering data that can be used for chargeback. This publication is for experienced users of IBM PowerVM and other virtualization solutions who want to understand and implement the next generation of enterprise virtualization management for Power Systems. Unless stated otherwise, the content of this publication refers to IBM PowerVC Version 2.0.0.

Deploying SAP Software in Red Hat OpenShift on IBM Power Systems

2021-04-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dino Quintero , Christoph Gremminger , Thorsten Hesemeyer , Sabine Jaeschke , Sahitya K Jain , Andreas Schauberer , Jochen Röhrig , Anastasiia Biliak

ERP IBM Kubernetes SAP data data-engineering ibm-power-systems

This IBM® Redpaper publication documents how to containerize and deploy SAP software into Red Hat OpenShift 4 Kubernetes clusters on IBM Power Systems by using predefined Red Hat Ansible scripts, different configurations, and theoretical knowledge, and it documents the findings through sample scenarios. This paper documents the following topics: Running SAP S/4HANA, SAP HANA, and SAP NetWeaver on-premises software in containers that are deployed in Red Hat OpenShift 4 on IBM Power Systems hardware. Existing SAP systems running on IBM Power Systems can be repackaged at customer sites into containers that use predefined Red Hat Ansible scripts. These containers can be deployed multiple times into Red Hat OpenShift 4 Kubernetes clusters on IBM Power Systems. The target audiences for this paper are Chief Information Officers (CIOs) that are interested in containerized solutions of SAP Enterprise Resource Planning (ERP) systems, developers that need containerized environments, and system administrators that provide and manage the infrastructure with underpinning automation. This paper complements the documentation that is available at IBM Knowledge Center, and it aligns with the educational materials that are provided by IBM Garage™ for Systems Education.

Data Infrastructure Automation For Private SaaS At Snowplow

2020-02-18 · Data Engineering Podcast Listen

podcast_episode

by Josh Beemster (Snowplow) , Tobias Macey

AI/ML Analytics AWS CloudFormation CloudWatch Amazon EMR Kinesis Big Data Cloud Computing Data Engineering Data Management ELK +12 more

Summary One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the components in your system architecture and the nature of your managed service? What are some of the challenges that are inherent to private SaaS nature of your managed service? What elements of your system require the most attention and maintenance to keep them running properly? Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity? How do you manage deployment of the full Snowplow pipeline for your customers?

How has your strategy for deployment evolved since you first began Soffering the managed service? How has the architecture of the pipeline evolved to simplify operations?

How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?

What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?

How does that reflect in the tooling that you use to manage their deployments?

What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow? What are some lessons that you can generalize for management of data infrastructure more broadly? If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently? What do you have planned for the future of the Snowplow product and infrastructure management?

Contact Info

LinkedIn jbeemster on GitHub @jbeemster1 on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Snowplow Analytics

Podcast Episode

Terraform Consul Nomad Meltdown Vulnerability Spectre Vulnerability AWS Kinesis Elasticsearch SnowflakeDB Indicative S3 Segment AWS Cloudwatch Stackdriver Apache Kafka Apache Pulsar Google Cloud PubSub AWS SQS AWS SNS AWS Redshift Ansible AWS Cloudformation Kubernetes AWS EMR

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Ceph: Designing and Implementing Scalable Storage Systems

2019-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vikhyat Umrao , Nick Fisk , Michael Hackett , Karan Singh

Cloud Computing Linux VirtualBox Virtual Machine ceph data data-engineering

Get to grips with the unified, highly scalable distributed storage system and learn how to design and implement it. Key Features Explore Ceph's architecture in detail Implement a Ceph cluster successfully and gain deep insights into its best practices Leverage the advanced features of Ceph, including erasure coding, tiering, and BlueStore Book Description This Learning Path takes you through the basics of Ceph all the way to gaining in-depth understanding of its advanced features. You'll gather skills to plan, deploy, and manage your Ceph cluster. After an introduction to the Ceph architecture and its core projects, you'll be able to set up a Ceph cluster and learn how to monitor its health, improve its performance, and troubleshoot any issues. By following the step-by-step approach of this Learning Path, you'll learn how Ceph integrates with OpenStack, Glance, Manila, Swift, and Cinder. With knowledge of federated architecture and CephFS, you'll use Calamari and VSM to monitor the Ceph environment. In the upcoming chapters, you'll study the key areas of Ceph, including BlueStore, erasure coding, and cache tiering. More specifically, you'll discover what they can do for your storage system. In the concluding chapters, you will develop applications that use Librados and distributed computations with shared object classes, and see how Ceph and its supporting infrastructure can be optimized. By the end of this Learning Path, you'll have the practical knowledge of operating Ceph in a production environment. This Learning Path includes content from the following Packt products: Ceph Cookbook by Michael Hackett, Vikhyat Umrao and Karan Singh Mastering Ceph by Nick Fisk Learning Ceph, Second Edition by Anthony D'Atri, Vaibhav Bhembre and Karan Singh What you will learn Understand the benefits of using Ceph as a storage solution Combine Ceph with OpenStack, Cinder, Glance, and Nova components Set up a test cluster with Ansible and virtual machine with VirtualBox Develop solutions with Librados and shared object classes Configure BlueStore and see its interaction with other configurations Tune, monitor, and recover storage systems effectively Build an erasure-coded pool by selecting intelligent parameters Who this book is for If you are a developer, system administrator, storage professional, or cloud engineer who wants to understand how to deploy a Ceph cluster, this Learning Path is ideal for you. It will help you discover ways in which Ceph features can solve your data storage problems. Basic knowledge of storage systems and GNU/Linux will be beneficial.

Managing Database Access Control For Teams With strongDM

2019-01-29 · Data Engineering Podcast Listen

podcast_episode

by Justin McCarthy (StrongDM) , Tobias Macey

Chef Data Engineering Data Management React

Summary Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Justin McCarthy about StrongDM, a hosted service that simplifies access controls for your data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining the problem that StrongDM is solving and how the company got started?

What are some of the most common challenges around managing access and authentication for data storage systems? What are some of the most interesting workarounds that you have seen? Which areas of authentication, authorization, and auditing are most commonly overlooked or misunderstood?

Can you describe the architecture of your system?

What strategies have you used to enable interfacing with such a wide variety of storage systems?

What additional capabilities do you provide beyond what is natively available in the underlying systems? What are some of the most difficult aspects of managing varying levels of permission for different roles across the diversity of platforms that you support, given that they each have different capabilities natively? For a customer who is onboarding, what is involved in setting up your platform to integrate with their systems? What are some of the assumptions that you made about your problem domain and market when you first started which have been disproven? How do organizations in different industries react to your product and how do their policies around granting access to data differ? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of building and growing StrongDM?

Contact Info

LinkedIn @justinm on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

StrongDM Authentication Vs. Authorization Hashicorp Vault Configuration Management Chef Puppet SaltStack Ansible Okta SSO (Single Sign On SOC 2 Two Factor Authentication SSH (Secure SHell) RDP

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Putting Airflow Into Production With James Meickle - Episode 43

2018-08-13 · Data Engineering Podcast Listen

podcast_episode

by James Meickle , Tobias Macey

Airflow API Astronomer AWS CloudFormation AWS Glue Data Engineering Data Management Data Science DevOps ETL/ELT GitHub +7 more

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

2018-07-30 · Data Engineering Podcast Listen

podcast_episode

by Peter Lubell-Doughtie (Ona) , Tobias Macey

API Chef Data Collection Data Engineering Data Management DataOps Docker Druid DWH GitHub Kafka Superset +2 more

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

Introduction How did you get involved in the area of data management? What is Ona and how did the company get started?

What are some examples of the types of customers that you work with?

What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?

What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?

What are your plans for the future of Ona and Canopy?

Contact Info

Email pld on Github Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

2018-02-04 · Data Engineering Podcast Listen

podcast_episode

by Matteo Merli , Rajan Dhabalia , Tobias Macey

AI/ML API Data Engineering Data Management Data Science GitHub Kafka Linux Pub/Sub

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar?

Contact Info

Matteo

merlimat on GitHub @merlimat on Twitter

Rajan

@dhabaliaraj on Twitter rhabalia on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployme

talk-data.com

Activity Trend

Top Events

Top Speakers

Continue evolving open source in the cloud with Red Hat and Microsoft

Fabric-X Deployment Strategies

Fabric-X Deployment Strategies: Ansible and Kubernetes

Fabric-X Endorsement Phase Deep Dive

Fabric-X Endorsement Phase Deep Dive

Data Engineering for Cybersecurity

Migrating and optimizing workloads on Google Cloud

Technical Deep Dive: Ansible & Terraform - PB&J of EnterpriseIT Automation (Current & Future)

Continuous Network Compliance (compliance-as-code): How organizations should leverage event-driven Ansible (EDA) to respond to changing configurations and compliance in real time

Real world use cases: Ansible Automation Platform customer panel discussion

Accelerate automation with generative AI and Ansible's enhanced developer experience

Real world story of moving to Managed Ansible in the Cloud from self-managed AAP

IBM PowerVC Version 2.0 Introduction and Configuration

Deploying SAP Software in Red Hat OpenShift on IBM Power Systems

Data Infrastructure Automation For Private SaaS At Snowplow

Ceph: Designing and Implementing Scalable Storage Systems

Managing Database Access Control For Teams With strongDM

Putting Airflow Into Production With James Meickle - Episode 43

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17