Docker

Data Science Platform at PlayStation and Apache Airflow

2022-07-01 · Airflow Summit 2022

session

by Hamed Saljooghinejad , Siraj Malik

Airflow Data Science Kubernetes

In this talk, we explain how Apache Airflow is at the center of our Kubernetes-based Data Science Platform at PlayStation. We talk about how we built a flexible development environment for Data Scientists to interact with Apache Airflow and explain the tools and processes we built to help Data Scientists promote their dags from development to production. We will also talk about the impact of containerization and the usage of KubernetesOperator and the new SparkKubernetesOperator and the benefits of deploying Airflow in Kubernetes using the KubernetesExecutor across multiple environments.

From Academia to Data Analytics and Engineering - Gloria Quiceno

2022-05-20 · DataTalks.Club Listen

podcast_episode

by Gloria Quiceno

Analytics Cloud Computing Data Analytics Data Engineering Data Science GitHub HTML MATLAB MLOps Python SQL

We talked about:

Gloria’s background Working with MATLAB, R, C, Python, and SQL Working at ICE Job hunting after the bootcamp Data engineering vs Data science Using Docker Keeping track of job applications, employers and questions Challenges during the job search and transition Concerns over data privacy Challenges with salary negotiation The importance of career coaching and support Skills learned at Spiced Retrospective on Gloria’s transition to data and advice Top skills that helped Gloria get the job Thoughts on cloud platforms Thoughts on bootcamps and courses Spiced graduation project Standing out in a sea of applicants The cohorts at Spiced Conclusion

Links:

LinkedIn: https://www.linkedin.com/in/gloria-quiceno/ Github: https://github.com/gdq12

MLOps Zoomcamp: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Logging in Action

2022-04-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Phil Wilkins

Cloud Computing IoT JSON Kubernetes MongoDB data data-engineering elastic-stack-elk-stack elastic stack (elk stack) elasticsearch search

Make log processing a real asset to your organization with powerful and free open source tools. In Logging in Action you will learn how to: Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled Configure Fluentd and Fluent Bit to solve common log management problems Use Fluentd within Kubernetes and Docker services Connect a custom log source or destination with Fluentd’s extensible plugin framework Logging best practices and common pitfalls Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data. About the Technology Don’t fly blind! An effective logging system can help you see and correct problems before they cripple your software. With the Fluentd log management tool, it’s a snap to monitor the behavior and health of your software and infrastructure in real time. Designed to collect and process log data from multiple sources using the industry-standard JSON format, Fluentd delivers a truly unified logging layer across all your systems. About the Book Logging in Action teaches you to record and analyze application and infrastructure data using Fluentd. Using clear, relevant examples, it shows you exactly how to transform raw system data into a unified stream of actionable information. You’ll discover how logging configuration impacts the way your system functions and set up Fluentd to handle data from legacy IT environments, local data centers, and massive Kubernetes-driven distributed systems. You’ll even learn how to implement complex log parsing with RegEx and output events to MongoDB and Slack. What's Inside Capture log events from a wide range of systems and software, including Kubernetes and Docker Connect to custom log sources and destinations Employ Fluentd’s extensible plugin framework Create a custom plugin for niche problems About the Reader For developers, architects, and operations professionals familiar with the basics of monitoring and logging. About the Author Phil Wilkins has spent over 30 years in the software industry. Has worked for small startups through to international brands. Quotes I highly recommend using Logging in Action as a getting-started guide, a refresher, or as a way to optimize your logging journey. - From the Foreword by Anurag Gupta, Fluent maintainer and Cofounder, Calyptia Covers everything you need if you want to implement a logging system using open source technology such as Fluentd and Kubernetes. - Alex Saez, Naranja X A great exploration of the features and capabilities of Fluentd, along with very useful hands-on exercises. - George Thomas, Manhattan Associates A practical holistic guide to integrating logging into your enterprise architecture. - Satej Sahu, Honeywell

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

2022-03-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Scott Haines (Databricks)

AI/ML Airflow Data Contracts Data Engineering Kafka Kubernetes MySQL Redis S3 Spark SQL Data Streaming +3 more

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow. Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes. Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production. What You Will Learn Simplify data transformation with Spark Pipelines and Spark SQL Bridge data engineering with machine learning Architect modular data pipeline applications Build reusable application components and libraries Containerize your Spark applications for consistency and reliability Use Docker and Kubernetes to deploy your Spark applications Speed up application experimentation using Apache Zeppelin and Docker Understand serializable structured data and data contracts Harness effective strategies for optimizing data in your data lakes Build end-to-end Spark structured streaming applications using Redis and Apache Kafka Embrace testing for your batch and streaming applications Deploy and monitor your Spark applications Who This Book Is For Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

2022-01-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eben Hewitt , Jeff Carpenter

Cassandra Cloud Computing Data Modelling ELK Kafka Kubernetes Spark data data-engineering nosql-databases

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you'll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This revised third edition--updated for Cassandra 4.0 and new developments in the Cassandra ecosystem, including deployments in Kubernetes with K8ssandra--provides technical details and practical examples to help you put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra's nonrelational design, with special attention to data modeling. Developers, DBAs, and application architects looking to solve a database scaling issue or future-proof an application will learn how to harness Cassandra's speed and flexibility. Understand Cassandra's distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh (the CQL shell) Create a working data model and compare it with an equivalent relational model Design and develop applications using client drivers Explore cluster topology and learn how nodes exchange data Maintain a high level of performance in your cluster Deploy Cassandra onsite, in the cloud, or with Docker and Kubernetes Integrate Cassandra with Spark, Kafka, Elasticsearch, Solr, and Lucene

Al and Kim Smith discuss IBM Consulting around Containerization and technology trends. Plus the 10 women in Cloud and game changing women leaders

2022-01-05 · Making Data Simple Listen

podcast_episode

by Kim Smith (IBM) , Al Martin (IBM)

Cloud Computing IBM

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Making Data Simple Podcast is hosted by Al Martin, VP, IBM Expert Services Delivery, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun. This week on Making Data Simple, we have Kim Smith. Kim is Global Vice President, Hybrid Cloud Services Consulting at IBM. Kim is author, UNCTAD speaker, executive board member, top 10 women in cloud, 10 ten game changing female leaders. Show Notes 1:25 – Kim’s experience 5:44 – What did you code in? 8:56 – How do you continue to reinvent yourself? 11:29 – What have you done to drive value? 14:02 – Describe your roll at IBM 18:18 – What does IBM offer in Containerization? 24:54 – What use cases are you working on? 28:10 – How does the engagement work? 35:48 – What are the top technology trends going to be? 42:53 – Say more on the top 10 women in Cloud and the top 10 game changing female leaders Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data Science at the Command Line, 2nd Edition

2021-08-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jeroen Janssens

Agile/Scrum API CSV Data Science HTML JSON Linux Python Spark Unix XML data +1 more

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux. You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers. Obtain data from websites, APIs, databases, and spreadsheets Perform scrub operations on text, CSV, HTML, XML, and JSON files Explore data, compute descriptive statistics, and create visualizations Manage your data science workflow Create your own tools from one-liners and existing Python or R code Parallelize and distribute data-intensive pipelines Model data with dimensionality reduction, regression, and classification algorithms Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

Build Your Own Data Pipeline - Andreas Kretz

2021-07-02 · DataTalks.Club Listen

podcast_episode

by Andreas Kretz

Cloud Computing Data Engineering Data Science DevOps Hadoop HTML

We talked about:

Andreas’s background Why data engineering is becoming more popular Who to hire first – a data engineer or a data scientist? How can I, as a data scientist, learn to build pipelines? Don’t use too many tools What is a data pipeline and why do we need it? What is ingestion? Can just one person build a data pipeline? Approaches to building data pipelines for data scientists Processing frameworks Common setup for data pipelines — car price prediction Productionizing the model with the help of a data pipeline Scheduling Orchestration Start simple Learning DevOps to implement data pipelines How to choose the right tool Are Hadoop, Docker, Cloud necessary for a first job/internship? Is Hadoop still relevant or necessary? Data engineering academy How to pick up Cloud skills Avoid huge datasets when learning Convincing your employer to do data science How to find Andreas

Links:

LinkedIn: https://www.linkedin.com/in/andreas-kretz Data engieering cookbook: https://cookbook.learndataengineering.com/ Course: https://learndataengineering.com/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Airflow loves Kubernetes

2021-07-01 · Airflow Summit 2021

session

by Jarek Potiuk (Apache Software Foundation) , Kaxil Naik

Airflow Kubernetes

In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment. The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users. Starting from official container image, through quick-start docker-compose configuration, culminating in April with release of the official Helm Chart for Airflow. This talk is aimed for Airflow users who would like to make use of all the effort. The users will learn how to: Extend or customize Airflow Official Docker Image to adapt it to their needs Run quickstart docker-compose environment where they can quickly verify their images Configure and deploy Airflow on Kubernetes using the Official Airflow Helm chart

Lessons Learned From The Pipeline Data Engineering Academy

2021-06-26 · Data Engineering Podcast Listen

podcast_episode

by Daniel Molnar , Peter Fabian (Pipeline Academy) , Tobias Macey

BI Data Engineering Data Management DWH ETL/ELT GitHub Kafka Kubernetes Looker Modern Data Stack Pandas Prefect +3 more

Summary Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing the curriculum and learning goals for the students? How did you set a common baseline for all of the students to build from throughout the program?

What was your process for determining the structure of the tasks and the tooling used?

What were some of the topics/tools that the students had the most difficulty with?

What topics/tools were the easiest to grasp?

What are some difficulties that you encountered while trying to teach different concepts? How did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for? What are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them? What are some of the failures that you encountered and what lessons have you taken from them? How did the pandemic impact your overall plan and execution of the initial cohort? What were the skills that you focused on for interview preparation? What level of ongoing support/engagement do you have with students once they complete the curriculum? What are the most interesting, innovative, or unexpected solutions that you saw from your students? What are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort? When is a bootcamp the wrong approach for skill development? What do you have planned for the future of the Pipeline Academy?

Contact Info

Daniel

LinkedIn Website @soobrosa on Twitter

Peter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pipeline Academy

Blog

Scikit Pandas Urchin Kafka Three "C"s – Context, Confidence, and Code Prefect

Podcast Episode

Great Expectations

Podcast Episode Podcast.init Episode

Docker Kubernetes Become a Data Engineer On A Shoestring James Mickens

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Self Service Open Source Data Integration With AirByte

2021-02-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , John Lafleur (Airbyte) , Tobias Macey

Airbyte Airflow Avro BI BigQuery CI/CD Cloud Computing Dagster Data Engineering Data Management Data Quality Datacoral +21 more

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users?

How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?

what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented?

What was your motivation for choosing Java as the primary language?

incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.

information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)

interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core

impact of community adoption/contributions

What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project?

Contact Info

Michel

LinkedIn @MichelTricot on Twitter michel-tricot on GitHub

John

LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Airbyte Liveramp Fivetran

Podcast Episode

Stitch Data Matillion DataCoral

Podcast Episode

Singer Meltano

Podcast Episode

Airflow

Podcast.init Episode

Kotlin Docker Monorepo Airbyte Specification Great Expectations

Podcast Episode

Dagster

Data Engineering Podcast Episode Podcast.init Episode

Prefect

Podcast Episode

DBT

Podcast Episode

Kubernetes Snowflake

Podcast Episode

Redshift Presto Spark Parquet

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle

2020-12-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ramcharan Kakarla , Sridhar Alla , Sundar Krishnan

AI/ML API Data Science PySpark apache-spark data data-engineering

Discover the capabilities of PySpark and its application in the realm of data science. This comprehensive guide with hand-picked examples of daily use cases will walk you through the end-to-end predictive model-building cycle with the latest techniques and tricks of the trade. Applied Data Science Using PySpark is divided unto six sections which walk you through the book. In section 1, you start with the basics of PySpark focusing on data manipulation. We make you comfortable with the language and then build upon it to introduce you to the mathematical functions available off the shelf. In section 2, you will dive into the art of variable selection where we demonstrate various selection techniques available in PySpark. In section 3, we take you on a journey through machine learning algorithms, implementations, and fine-tuning techniques. We will also talk about different validation metrics and how to use them for picking the best models. Sections 4 and 5 go through machine learning pipelines and various methods available to operationalize the model and serve it through Docker/an API. In the final section, you will cover reusable objects for easy experimentation and learn some tricks that can help you optimize your programs and machine learning pipelines. By the end of this book, you will have seen the flexibility and advantages of PySpark in data science applications. This book is recommended to those who want to unleash the power of parallel computing by simultaneously working with big datasets. What You Will Learn Build an end-to-end predictive model Implement multiple variable selection techniques Operationalize models Master multiple algorithms and implementations Who This Book is For Data scientists and machine learning and deep learning engineers who want to learn and use PySpark for real-time analysis of streamingdata.

MongoDB Topology Design: Scalability, Security, and Compliance on a Global Scale

2020-09-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Nicholas Cottrell

Cloud Computing DevOps GDPR/CCPA Kubernetes MongoDB Cyber Security data data-engineering nosql-databases

Create a world-class MongoDB cluster that is scalable, reliable, and secure. Comply with mission-critical regulatory regimes such as the European Union’s General Data Protection Regulation (GDPR). Whether you are thinking of migrating to MongoDB or need to meet legal requirements for an existing self-managed cluster, this book has you covered. It begins with the basics of replication and sharding, and quickly scales up to cover everything you need to know to control your data and keep it safe from unexpected data loss or downtime. This book covers best practices for stable MongoDB deployments. For example, a well-designed MongoDB cluster should have no single point of failure. The book covers common use cases when only one or two data centers are available. It goes into detail about creating geopolitical sharding configurations to cover the most stringent data protection regulation compliance. The book also covers different tools and approaches for automating and monitoring a cluster with Kubernetes, Docker, and popular cloud provider containers. What You Will Learn Get started with the basics of MongoDB clusters Protect and monitor a MongoDB deployment Deepen your expertise around replication and sharding Keep effective backups and plan ahead for disaster recovery Recognize and avoid problems that can occur in distributed databases Build optimal MongoDB deployments within hardware and data center limitations Who This Book Is For Solutions architects, DevOps architects and engineers, automation and cloud engineers, and database administrators who are new to MongoDB and distributed databases or who need to scale up simple deployments. This book is a complete guide to planning a deployment for optimal resilience, performance, and scaling, and covers all the details required to meet the new set of data protection regulations such as the GDPR. This book is particularly relevant for large global organizations such as financial and medical institutions, as well as government departments that need to control data in the whole stack and are prohibited from using managed cloud services.

Airflow: A beast character in the gaming world

2020-07-01 · Airflow Summit 2020

session

by Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Amazon EC2 Big Data Data Analytics ETL/ELT Python Spark

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Airflow as an elastic ETL tool

2020-07-01 · Airflow Summit 2020

session

by Vicente Rubén Del Pino Ruiz (UnitedHealth Group) , Hendrik Kleine (Optum)

AI/ML Airflow ELK ETL/ELT SQL

In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.

Production Docker image for Apache Airflow

2020-07-01 · Airflow Summit 2020

session

by Jarek Potiuk (Apache Software Foundation)

Airflow

This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Introducing Microsoft SQL Server 2019

2020-04-27 · O'Reilly SQL Books O'Reilly Amazon

book

by James Rowland-Jones , Mitchell Pearson , Arun Sirpal , Dave Noderer , Dustin Ryan , Kellyn Gorman , Buck Woody , Allan Hirt

Analytics Azure BI Big Data Cloud Computing Data Management Hadoop HDFS Kubernetes Microsoft NoSQL Power BI +4 more

Introducing Microsoft SQL Server 2019 is the must-have guide for database professionals eager to leverage the latest advancements in SQL Server 2019. This book covers the features and capabilities that make SQL Server 2019 a powerful tool for managing and analyzing data both on-premises and in the cloud. What this Book will help me do Understand the new features introduced in SQL Server 2019 and their practical applications. Confidently manage and analyze relational, NoSQL, and big data within SQL Server 2019. Implement containerization for SQL Server using Docker and Kubernetes. Migrate and integrate your databases effectively to use Power BI Report Server. Query data from Hadoop Distributed File System with Azure Data Studio. Author(s) The authors of 'Introducing Microsoft SQL Server 2019' are subject matter experts including Kellyn Gorman, Allan Hirt, and others. With years of professional experience in database management and SQL Server, they bring a wealth of practical insight and knowledge to the book. Their experience spans roles as administrators, architects, and educators in the field. Who is it for? This book is aimed at database professionals such as DBAs, architects, and big data engineers who are currently using earlier versions of SQL Server or other database platforms. It is particularly well-suited for professionals aiming to understand and implement SQL Server 2019's new features. Readers should have basic familiarity with SQL Server and RDBMS concepts. If you're looking to explore SQL Server 2019 to improve data management and analytics in your organization, this book is for you.

Solving Data Lineage Tracking And Data Discovery At WeWork

2019-12-16 · Data Engineering Podcast Listen

podcast_episode

by Willy Lulciuc (WeWork) , Julien Le Dem (Astronomer) , Tobias Macey

AI/ML Airflow Analytics Big Data Dagster Data Engineering Data Management Data Modelling Data Quality Google Dataform dbt Delta +18 more

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

Mastering SQL Server 2017

2019-08-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Milos Radivojevic , William Durkin , Dejan Sarka , Matija Lah

AI/ML Azure BI DWH ETL/ELT JSON Linux Microsoft Python SQL SQL Server SSIS +4 more

Leverage the power of SQL Server 2017 Integration Services to build data integration solutions with ease Key Features Work with temporal tables to access information stored in a table at any time Get familiar with the latest features in SQL Server 2017 Integration Services Program and extend your packages to enhance their functionality Book Description Microsoft SQL Server 2017 uses the power of R and Python for machine learning and containerization-based deployment on Windows and Linux. By learning how to use the features of SQL Server 2017 effectively, you can build scalable apps and easily perform data integration and transformation. You'll start by brushing up on the features of SQL Server 2017. This Learning Path will then demonstrate how you can use Query Store, columnstore indexes, and In-Memory OLTP in your apps. You'll also learn to integrate Python code in SQL Server and graph database implementations for development and testing. Next, you'll get up to speed with designing and building SQL Server Integration Services (SSIS) data warehouse packages using SQL server data tools. Toward the concluding chapters, you'll discover how to develop SSIS packages designed to maintain a data warehouse using the data flow and other control flow tasks. By the end of this Learning Path, you'll be equipped with the skills you need to design efficient, high-performance database applications with confidence. This Learning Path includes content from the following Packt books: SQL Server 2017 Developer's Guide by Milos Radivojevic, Dejan Sarka, et. al SQL Server 2017 Integration Services Cookbook by Christian Cote, Dejan Sarka, et. al What you will learn Use columnstore indexes to make storage and performance improvements Extend database design solutions using temporal tables Exchange JSON data between applications and SQL Server Migrate historical data to Microsoft Azure by using Stretch Database Design the architecture of a modern Extract, Transform, and Load (ETL) solution Implement ETL solutions using Integration Services for both on-premise and Azure data Who this book is for This Learning Path is for database developers and solution architects looking to develop ETL solutions with SSIS, and explore the new features in SSIS 2017. Advanced analysis practitioners, business intelligence developers, and database consultants dealing with performance tuning will also find this book useful. Basic understanding of database concepts and T-SQL is required to get the best out of this Learning Path.

Deploying a Database Instance in an IBM Cloud Private Cluster on IBM Z

2019-07-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian May

Cloud Computing IBM Kubernetes MariaDB Virtual Machine data data-engineering

This IBM® Redpaper™ publication shows you how to deploy a database instance within a container using an IBM Cloud™ Private cluster on IBM Z®. A preinstalled IBM Spectrum™ Scale 5.0.3 cluster file system provides back-end storage for the persistent volumes bound to the database. A container is a standard unit of software that packages code and all its dependencies, so the application runs quickly and reliably from one computing environment to another. By default, containers are ephemeral. However, stateful applications, such as databases, require some type of persistent storage that can survive service restarts or container crashes. IBM provides several products helping organizations build an environment on an IBM Z infrastructure to develop and manage containerized applications, including dynamic provisioning of persistent volumes. As an example for a stateful application, this paper describes how to deploy the relational database MariaDB using a Helm chart. The IBM Spectrum Scale V5.0.3 cluster file system is providing back-end storage for the persistent volumes. This document provides step-by-step guidance regarding how to install and configure the following components: IBM Cloud Private 3.1.2 (including Kubernetes) Docker 18.03.1-ce IBM Storage Enabler for Containers 2.0.0 and 2.1.0 This Redpaper demonstrates how we set up the example for a stateful application in our lab. The paper gives you insights about planning for your implementation. IBM Z server hardware, the IBM Z hypervisor z/VM®, and the IBM Spectrum Scale cluster file system are prerequisites to set up the example environment. The Redpaper is written with the assumption that you have familiarity with and basic knowledge of the software products used in setting up the environment. The intended audience includes the following roles: Storage administrators IT/Cloud administrators Technologists IT specialists

talk-data.com

Activity Trend

Top Events

Top Speakers

Data Science Platform at PlayStation and Apache Airflow

From Academia to Data Analytics and Engineering - Gloria Quiceno

Logging in Action

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

Cassandra: The Definitive Guide, (Revised) Third Edition, 3rd Edition

Al and Kim Smith discuss IBM Consulting around Containerization and technology trends. Plus the 10 women in Cloud and game changing women leaders

Data Science at the Command Line, 2nd Edition

Build Your Own Data Pipeline - Andreas Kretz

Airflow loves Kubernetes

Lessons Learned From The Pipeline Data Engineering Academy

Self Service Open Source Data Integration With AirByte

Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle

MongoDB Topology Design: Scalability, Security, and Compliance on a Global Scale

Airflow: A beast character in the gaming world

Airflow as an elastic ETL tool

Production Docker image for Apache Airflow

Introducing Microsoft SQL Server 2019

Solving Data Lineage Tracking And Data Discovery At WeWork

Mastering SQL Server 2017

Deploying a Database Instance in an IBM Cloud Private Cluster on IBM Z