JSON

Full Stack FastAPI, React, and MongoDB

2022-09-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Marko Aleksendrić

JavaScript MongoDB Python React Redis data data-engineering nosql-databases

Master web development with the FARM stack in this comprehensive guide. You'll learn to harness FastAPI for a secure and efficient backend, React for a dynamic frontend, and MongoDB for flexible data storage. Gain practical experience by building fully functional projects that you can deploy and fine-tune, opening doors to enhanced proficiency in modern web technologies. What this Book will help me do Build secure and performant backends using FastAPI and understand its integration with MongoDB. Develop responsive and dynamic user interfaces with React and incorporate server-side rendering for improved SEO. Explore the intricacies of deploying full-stack applications on platforms like Heroku and Netlify. Implement robust user authentication systems with JSON Web Tokens for securing your applications. Apply caching strategies with Redis to enhance the performance and scalability of applications. Author(s) Marko Aleksendrić, the author of this book, combines years of experience in software development with a passion for teaching. Specializing in full-stack web technologies, Marko has a track record of guiding developers in mastering modern tools like FastAPI and React. His practical approach focuses on equipping readers with real-world skills through projects and best practices. Who is it for? This book is ideal for developers with foundational knowledge in Python, JavaScript, and web basics who want to expand their expertise into full-stack development. Whether you're a professional seeking to enhance your project toolkit or a beginner aiming to tackle modern web applications, this guide provides a step-by-step approach tailored to your growth.

SQL for Data Analytics - Third Edition

2022-08-29 · O'Reilly SQL Books O'Reilly Amazon

book

by Upom Malik , Benjamin Johnston , Matt Goldwasser , Jun Shan

Analytics Data Analytics SQL

SQL for Data Analytics is an accessible guide to helping readers efficiently use SQL for data analytics tasks. You will learn the ins and outs of writing SQL queries, preparing datasets, and utilizing advanced features like geospatial data handling and window functions. Demystify the process of harnessing SQL to tackle analytical data challenges in a structured and hands-on way. What this Book will help me do Become proficient in preparing and managing datasets using SQL. Learn to write efficient SQL queries for summarizing and analyzing data. Master advanced SQL features, including window functions and JSON handling. Optimize SQL queries and automate analytical tasks for efficiency. Gain practical experience analyzing data with real-world scenarios. Author(s) The authors, Jun Shan, Matt Goldwasser, Upom Malik, and Benjamin Johnston, are experienced professionals in data analytics and database management. They bring a blend of technical expertise and practical insights to teaching SQL for analytics. Their collective knowledge ensures that the book caters to all levels, from foundational concepts to advanced techniques. Who is it for? This book is ideal for database engineers transitioning into analytics, backend engineers looking to deepen their understanding of production data, and data scientists or business analysts seeking to boost their SQL analytics skills. Readers should have a basic grasp of SQL and familiarity with statistics and linear algebra to fully benefit from the contents.

MySQL Cookbook, 4th Edition

2022-08-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alkin Tezuysal , Sveta Smirnova

MySQL data data-engineering relational-databases

For MySQL, the price of popularity comes with a flood of questions from users on how to solve specific data-related issues. That's where this cookbook comes in. When you need quick solutions or techniques, this handy resource provides scores of short, focused pieces of code, hundreds of worked-out examples, and clear, concise explanations for programmers who don't have the time (or expertise) to resolve MySQL problems from scratch. In this updated fourth edition, authors Sveta Smirnova and Alkin Tezuysal provide more than 200 recipes that cover powerful features in both MySQL 5.7 and 8.0. Beginners as well as professional database and web developers will dive into topics such as MySQL Shell, MySQL replication, and working with JSON. You'll learn how to: Connect to a server, issue queries, and retrieve results Retrieve data from the MySQL Server Store, retrieve, and manipulate strings Work with dates and times Sort query results and generate summaries Assess the characteristics of a dataset Write stored functions and procedures Use stored routines, triggers, and scheduled events Perform basic MySQL administration tasks Understand MySQL monitoring fundamentals

Powering Up the Business with a Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

CI/CD Data Lakehouse Data Quality Databricks Delta GDPR/CCPA

Within Wehkamp we required a uniform way to provide reliable and on time data to the business, while making this access compliant with GDPR. Unlocking all the data sources that we have scattered across the company and democratize the data access was of the utmost importance, allowing us to empower the business with more, better and faster data.

Focusing on open source technologies, we've built a data platform almost from the ground up that focuses on 3 levels of data curation - bronze, silver and gold - which follows the LakeHouse Architecture. The ingestion into bronze is where the PII fields are pseudonymized, making the use of the data within the delta lake compliant and, since there is no visible user data, it means everyone can use the entire delta lake for exploration and new use cases. Naturally, specific teams are allowed to see some user data that is necessary for their use cases. Besides the standard architecture, we've developed a library that allows us to ingest new data sources by adding a JSON config file with the characteristics. This combined with the ACID transactions that delta provides and the efficient Structured Stream provided through Auto Loader has allowed a small team to maintain 100+ streams with insignificant downtime.

Some other components of this platform are the following: - Alerting to Slack - Data quality checks - CI/CD - Stream processing with the delta engine

The feedback so far has been encouraging, as more and more teams across the company are starting to use the new platform and taking advantage of all its perks. It is still a long time until we get to turn off some of the components of the old data platform, but it has come a long way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics API ClickHouse Databricks Druid Marketing SQL

Spreadsheets revolutionized IT by giving end users the ability to create their own analytics. Providing direct end user access to trillion-row datasets generated in financial markets or digital marketing is much harder. New SQL data warehouses like ClickHouse and Druid can provide fixed latency with constant cost on very large datasets, which opens up new possibilities.

Our talk walks through recent experience on analytic apps developed by ClickHouse users that enable end users like market traders to develop their own analytics directly off raw data. We’ll cover the following topics.

Characteristics of new open source column databases and how they enable low-latency analytics at constant cost.
Idiomatic ways to validate new apps by building MVPs that support a wide range of queries on source data including storing source JSON, schema design, applying compression on columns, and building indexes for needle-in-a-haystack queries.
Incrementally identifying hotspots and applying easy optimizations to bring query performance into line with long term latency and cost requirements.
Methods of building accessible interfaces, including traditional dashboards, imitating existing APIs that are already known, and creating app-specific visualizations.

We’ll finish by summarizing a few of the benefits we’ve observed and also touch on ways that analytic infrastructure could be improved to make end user access even more productive. The lessons are as general as possible so that they can be applied across a wide range of analytic systems, not just ClickHouse.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Cloud Computing Databricks Spark

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

What's new in Airflow 2.3?

2022-07-01 · Airflow Summit 2022

session

by Kaxil Naik

Airflow

This session will talk about the awesome new features the community has built that would be part of Airflow 2.3. Highlights: Dynamic Task Mapping DB. Downgrades Pruning old DB records Connections using JSON UI Improvements

Discover And De-Clutter Your Unstructured Data With Aparavi

2022-06-13 · Data Engineering Podcast Listen

podcast_episode

by Rod Christensen (Aparavi) , Tobias Macey

AWS Azure BigQuery CDP Cloud Computing Data Engineering Data Lake Data Management Databricks ETL/ELT GCP Java +12 more

Summary Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives

Interview

Introduction How did you get involved in the area of data management? Can you describe what Aparavi is and the story behind it? Who are the target customers for Aparavi and how does that inform your product roadmap and messaging? What are some of th

Logging in Action

2022-04-10 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Phil Wilkins

Cloud Computing Docker IoT Kubernetes MongoDB data data-engineering elastic-stack-elk-stack elastic stack (elk stack) elasticsearch search

Make log processing a real asset to your organization with powerful and free open source tools. In Logging in Action you will learn how to: Deploy Fluentd and Fluent Bit into traditional on-premises, IoT, hybrid, cloud, and multi-cloud environments, both small and hyperscaled Configure Fluentd and Fluent Bit to solve common log management problems Use Fluentd within Kubernetes and Docker services Connect a custom log source or destination with Fluentd’s extensible plugin framework Logging best practices and common pitfalls Logging in Action is a guide to optimize and organize logging using the CNCF Fluentd and Fluent Bit projects. You’ll use the powerful log management tool Fluentd to solve common log management, and learn how proper log management can improve performance and make management of software and infrastructure solutions easier. Through useful examples like sending log-driven events to Slack, you’ll get hands-on experience applying structure to your unstructured data. About the Technology Don’t fly blind! An effective logging system can help you see and correct problems before they cripple your software. With the Fluentd log management tool, it’s a snap to monitor the behavior and health of your software and infrastructure in real time. Designed to collect and process log data from multiple sources using the industry-standard JSON format, Fluentd delivers a truly unified logging layer across all your systems. About the Book Logging in Action teaches you to record and analyze application and infrastructure data using Fluentd. Using clear, relevant examples, it shows you exactly how to transform raw system data into a unified stream of actionable information. You’ll discover how logging configuration impacts the way your system functions and set up Fluentd to handle data from legacy IT environments, local data centers, and massive Kubernetes-driven distributed systems. You’ll even learn how to implement complex log parsing with RegEx and output events to MongoDB and Slack. What's Inside Capture log events from a wide range of systems and software, including Kubernetes and Docker Connect to custom log sources and destinations Employ Fluentd’s extensible plugin framework Create a custom plugin for niche problems About the Reader For developers, architects, and operations professionals familiar with the basics of monitoring and logging. About the Author Phil Wilkins has spent over 30 years in the software industry. Has worked for small startups through to international brands. Quotes I highly recommend using Logging in Action as a getting-started guide, a refresher, or as a way to optimize your logging journey. - From the Foreword by Anurag Gupta, Fluent maintainer and Cofounder, Calyptia Covers everything you need if you want to implement a logging system using open source technology such as Fluentd and Kubernetes. - Alex Saez, Naranja X A great exploration of the features and capabilities of Fluentd, along with very useful hands-on exercises. - George Thomas, Manhattan Associates A practical holistic guide to integrating logging into your enterprise architecture. - Satej Sahu, Honeywell

Practical SQL, 2nd Edition

2022-03-01 · O'Reilly SQL Books O'Reilly Amazon

book

by Anthony DeBarros

GIS Microsoft MySQL Oracle RDBMS SQL SQL Server postgresql

Practical SQL is an approachable and fast-paced guide to SQL (Structured Query Language), the standard programming language for defining, organizing, and exploring data in relational databases. Anthony DeBarros, a journalist and data analyst, focuses on using SQL to find the story within your data. The examples and code use the open-source database PostgreSQL and its companion pgAdmin interface, and the concepts you learn will apply to most database management systems, including MySQL, Oracle, SQLite, and others.* You’ll first cover the fundamentals of databases and the SQL language, then build skills by analyzing data from real-world datasets such as US Census demographics, New York City taxi rides, and earthquakes from US Geological Survey. Each chapter includes exercises and examples that teach even those who have never programmed before all the tools necessary to build powerful databases and access information quickly and efficiently. You’ll learn how to: •Create databases and related tables using your own data •Aggregate, sort, and filter data to find patterns •Use functions for basic math and advanced statistical operations •Identify errors in data and clean them up •Analyze spatial data with a geographic information system (PostGIS) •Create advanced queries and automate tasks This updated second edition has been thoroughly revised to reflect the latest in SQL features, including additional advanced query techniques for wrangling data. This edition also has two new chapters: an expanded set of instructions on for setting up your system plus a chapter on using PostgreSQL with the popular JSON data interchange format. Learning SQL doesn’t have to be dry and complicated. Practical SQL delivers clear examples with an easy-to-follow approach to teach you the tools you need to build and manage your own databases. * Microsoft SQL Server employs a variant of the language called T-SQL, which is not covered by Practical SQL.

Snowflake Essentials: Getting Started with Big Data in the Cloud

2021-12-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bjorn Lindstrom , Frank Bell , Ruchi Soni , Sameer Videkar , Bhaskar B. Joshi , Raj Chirumamilla

Analytics Big Data Cloud Computing Cyber Security Snowflake XML data data-engineering

Understand the essentials of the Snowflake Database and the overall Snowflake Data Cloud. This book covers how Snowflake’s architecture is different from prior on-premises and cloud databases. The authors also discuss, from an insider perspective, how Snowflake grew so fast to become the largest software IPO of all time. Snowflake was the first database made specifically to be optimized with a cloud architecture. This book helps you get started using Snowflake by first understanding its architecture and what separates it from other database platforms you may have used. You will learn about setting up users and accounts, and then creating database objects. You will know how to load data into Snowflake and query and analyze that data, including unstructured data such as data in XML and JSON formats. You will also learn about Snowflake’s compute platform and the different data sharing options that are available. What YouWill Learn Run analytics in the Snowflake Data Cloud Create users and roles in Snowflake Set up security in Snowflake Set up resource monitors in Snowflake Set up and optimize Snowflake Compute Load, unload, and query structured and unstructured data (JSON, XML) within Snowflake Use Snowflake Data Sharing to share data Set up a Snowflake Data Exchange Use the Snowflake Data Marketplace Who This Book Is For Database professionals or information technology professionals who want to move beyond traditional database technologies by learning Snowflake, a new and massively scalable cloud-based database solution

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

2021-12-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rahul Sharma , Mohammad Atyab

AWS AWS Lambda Cloud Computing Java Kubernetes Cyber Security Data Streaming apache-pulsar data data-engineering

Apply different enterprise integration and processing strategies available with Pulsar, Apache's multi-tenant, high-performance, cloud-native messaging and streaming platform. This book is a comprehensive guide that examines using Pulsar Java libraries to build distributed applications with message-driven architecture. You'll begin with an introduction to Apache Pulsar architecture. The first few chapters build a foundation of message-driven architecture. Next, you'll perform a setup of all the required Pulsar components. The book also covers work with Apache Pulsar client library to build producers and consumers for the discussed patterns. You'll then explore the transformation, filter, resiliency, and tracing capabilities available with Pulsar. Moving forward, the book will discuss best practices when building message schemas and demonstrate integration patterns using microservices. Security is an important aspect of any application;the book will cover authentication and authorization in Apache Pulsar such as Transport Layer Security (TLS), OAuth 2.0, and JSON Web Token (JWT). The final chapters will cover Apache Pulsar deployment in Kubernetes. You'll build microservices and serverless components such as AWS Lambda integrated with Apache Pulsar on Kubernetes. After completing the book, you'll be able to comfortably work with the large set of out-of-the-box integration options offered by Apache Pulsar. What You'll Learn Examine the important Apache Pulsar components Build applications using Apache Pulsar client libraries Use Apache Pulsar effectively with microservices Deploy Apache Pulsar to the cloud Who This Book Is For Cloud architects and software developers who build systems in the cloud-native technologies.

PostGIS in Action, Third Edition

2021-09-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Leo S. Hsu , Regina Obe

GIS RDBMS SQL data data-engineering geographic-information-system-gis location-data postgis postgresql

In PostGIS in Action, Third Edition you will learn: An introduction to spatial databases Geometry, geography, raster, and topology spatial types, functions, and queries Applying PostGIS to real-world problems Extending PostGIS to web and desktop applications Querying data from external sources using PostgreSQL Foreign Data Wrappers Optimizing queries for maximum speed Simplifying geometries for greater efficiency PostGIS in Action, Third Edition teaches readers of all levels to write spatial queries for PostgreSQL. You’ll start by exploring vector-, raster-, and topology-based GIS before quickly progressing to analyzing, viewing, and mapping data. This fully updated third edition covers key changes in PostGIS 3.1 and PostgreSQL 13, including parallelization support, partitioned tables, and new JSON functions that help in creating web mapping applications. About the Technology PostGIS is a spatial database extender for PostgreSQL. It offers the features and firepower you need to take on nearly any geodata task. PostGIS lets you create location-aware queries with a few lines of SQL code, then build the backend for mapping, raster analysis, or routing application with minimal effort. About the Book PostGIS in Action, Third Edition shows you how to solve real-world geodata problems. You’ll go beyond basic mapping, and explore custom functions for your applications. Inside this fully updated edition, you’ll find coverage of new PostGIS features such as PostGIS Window functions, parallelization of queries, and outputting data for applications using JSON and Vector Tile functions. What's Inside Fully revised for PostGIS version 3.1 and PostgreSQL 13 Optimize queries for maximum speed Simplify geometries for greater efficiency Extend PostGIS to web and desktop applications About the Reader For readers familiar with relational databases and basic SQL. No prior geodata or GIS experience required. About the Authors Regina Obe and Leo Hsu are database consultants and authors. Regina is a member of the PostGIS core development team and the Project Steering Committee. Quotes The best introduction I’ve seen for engineers who want to get ramped up quickly and build advanced GIS applications. - Ikechukwu Okonkwo, Orum.io A wealth of information that showcases how powerful PostGIS is. - Luis Moux-Dominguez, EMO An extraordinary book for the world of GIS. Truly learned a lot! - DeUndre’ Rushon, DigiDiscover LLC Gives you insight into how best to provide map services for a wide audience. - Marcus Brown, Enel Green Power

Developing Modern Applications with a Converged Database

2021-08-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alice LaPlante

Analytics API Blockchain Cloud Computing IoT Marketing Cyber Security Data Streaming XML data data-engineering relational-databases

Single-purpose databases were designed to address specific problems and use cases. Given this narrow focus, there are inherent tradeoffs required when trying to accommodate multiple datatypes or workloads in your enterprise environment. The result is data fragmentation that spills over into application development, IT operations, data security, system scalability, and availability. In this report, author Alice LaPlante explains why developing modern, data-driven applications may be easier and more synergistic when using a converged database. Senior developers, architects, and technical decision-makers will learn cloud-native application development techniques for working with both structured and unstructured data. You'll discover ways to run transactional and analytical workloads on a single, unified data platform. This report covers: Benefits and challenges of using a converged database to develop data-driven applications How to use one platform to work with both structured and unstructured data that includes JSON, XML, text and files, spatial and graph, Blockchain, IoT, time series, and relational data Modern development practices on a converged database, including API-driven development, containers, microservices, and event streaming Use case examples including online food delivery, real-time fraud detection, and marketing based on real-time analytics and geospatial targeting

Data Science at the Command Line, 2nd Edition

2021-08-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jeroen Janssens

Agile/Scrum API CSV Data Science Docker HTML Linux Python Spark Unix XML data +1 more

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux. You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers. Obtain data from websites, APIs, databases, and spreadsheets Perform scrub operations on text, CSV, HTML, XML, and JSON files Explore data, compute descriptive statistics, and create visualizations Manage your data science workflow Create your own tools from one-liners and existing Python or R code Parallelize and distribute data-intensive pipelines Model data with dimensionality reduction, regression, and classification algorithms Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

Customizing Xcom to enhance data sharing between tasks

2021-07-01 · Airflow Summit 2021

session

by Vikram Koka (Astronomer) , Ephraim Anierobi

Airflow API Cloud Computing Cloud Storage Pandas S3

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations. With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON. This tutorial describes how to go beyond these limitations by developing and deploying a Custom Xcom backend within Airflow to enable the sharing of large and varied data elements such as Pandas data frames between tasks in a data pipeline, using a cloud storage such as Google Storage or Amazon S3.

Pro Power BI Theme Creation: JSON Stylesheets for Automated Dashboard Formatting

2021-05-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Adam Aspin

BI Dashboard JavaScript Marketing Power BI business-intelligence data data-science microsoft-power-platform power-bi

Use JSON theme files to standardize the look of Power BI dashboards and reports. This book shows how you can create theme files using the Power BI Desktop application to define high-level formatting attributes for dashboards as well as how to tailor detailed formatting specifications for individual dashboard elements in JSON files. Standardize the look of your dashboards and apply formatting consistently over all your reports. The techniques in this book provide you with tight control over the presentation of all aspects of the Power BI dashboards and reports that you create. Power BI theme files use JSON (JavaScript Object Notation) as their structure, so the book includes a brief introduction to JSON as well as how it applies to Power BI themes. The book further includes a complete reference to all the current formatting definitions and JSON structures that are at your disposal for creating JSON theme files. Finally, the book includes dozens of theme files, from the simple to the most complex, that you can adopt and adapt to suit your own requirements. What You Will Learn Produce designer output without manually formatting every individual visual in a Power BI dashboard Standardize presentation for families of dashboard types Switch presentation styles in a couple of clicks Save dozens, or hundreds, of hours laboriously formatting dashboards Define enterprise-wide presentation standards Retroactively apply standard styles to existing dashboards Who This Book Is For Power BI users who want to save time by defining standardized formatting for their dashboards and reports, IT professionals who want to create corporate standards of dashboard presentation, and marketing and communication specialists who want to set organizational standards for dashboard delivery

Cleaning Data for Effective Data Science

2021-03-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by David Mertz

AI/ML Data Quality Data Science Pandas Python SciPy SQL data data-science

Dive into the intricacies of data cleaning, a crucial aspect of any data science and machine learning pipeline, with 'Cleaning Data for Effective Data Science.' This comprehensive guide walks you through tools and methodologies like Python, R, and command-line utilities to prepare raw data for analysis. Learn practical strategies to manage, clean, and refine data encountered in the real world. What this Book will help me do Understand and utilize various data formats such as JSON, SQL, and PDF for data ingestion and processing. Master key tools like pandas, SciPy, and Tidyverse to manipulate and analyze datasets efficiently. Develop heuristics and methodologies for assessing data quality, detecting bias, and identifying irregularities. Apply advanced techniques like feature engineering and statistical adjustments to enhance data usability. Gain confidence in handling time series data by employing methods for de-trending and interpolating missing values. Author(s) David Mertz has years of experience as a Python programmer and data scientist. Known for his engaging and accessible teaching style, David has authored numerous technical articles and books. He emphasizes not only the technicalities of data science tools but also the critical thinking that approaches solutions creatively and effectively. Who is it for? 'Cleaning Data for Effective Data Science' is designed for data scientists, software developers, and educators dealing with data preparation. Whether you're an aspiring data enthusiast or an experienced professional looking to refine your skills, this book provides essential tools and frameworks. Prior programming knowledge, particularly in Python or R, coupled with an understanding of statistical fundamentals, will help you make the most of this resource.

Learn FileMaker Pro 19: The Comprehensive Guide to Building Custom Databases

2021-02-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mark Conway Munro

SQL data data-engineering filemaker

Discover how easy it is to create multi-user, cross-platform custom solutions with FileMaker Pro, the relational database platform published by Apple subsidiary Claris International, Inc. Meticulously rewritten with clearer lessons, more real-world examples and updated to include feature changes introduced in recent versions, this book makes it easier to get started planning, building and deploying a custom database solution. The material is presented in an easy to follow manner with each chapter building on the last. After an initial review of the user environment and application basics, it begins a deep exploration of the integrated development environment that seamlessly combines the full stack of data table schema, business logic and interface layers into one visual programming experience. This book includes everything needed to get started building custom databases and contains advanced material that seasoned professionals will appreciate. Written bya professional developer with decades of real-world experience, Learn FileMaker Pro 19 is your comprehensive learning and reference guide. Join millions of users and developers worldwide in achieving a new level of workflow efficiency with FileMaker Pro. What You’ll Learn Discover interface and feature changes in FileMaker 17-19 Create and maintain healthy files Plan and create custom tables, fields, relationships Write calculations using built-in and custom functions Build recursive and repeating formulas Discover advanced features using cURL, JSON, SQL, ODBC and FM URL Manipulate data files in the computer directory with scripts Deploy solutions to a server and share with desktop, iOS and web clients Who This Book Is For Casual programmers, full time consultants, and IT professionals

Self Service Open Source Data Integration With AirByte

2021-02-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , John Lafleur (Airbyte) , Tobias Macey

Airbyte Airflow Avro BI BigQuery CI/CD Cloud Computing Dagster Data Engineering Data Management Data Quality Datacoral +21 more

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users?

How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?

what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented?

What was your motivation for choosing Java as the primary language?

incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.

information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)

interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core

impact of community adoption/contributions

What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project?

Contact Info

Michel

LinkedIn @MichelTricot on Twitter michel-tricot on GitHub

John

LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Airbyte Liveramp Fivetran

Podcast Episode

Stitch Data Matillion DataCoral

Podcast Episode

Singer Meltano

Podcast Episode

Airflow

Podcast.init Episode

Kotlin Docker Monorepo Airbyte Specification Great Expectations

Podcast Episode

Dagster

Data Engineering Podcast Episode Podcast.init Episode

Prefect

Podcast Episode

DBT

Podcast Episode

Kubernetes Snowflake

Podcast Episode

Redshift Presto Spark Parquet

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

talk-data.com

Activity Trend

Top Events

Top Speakers

Full Stack FastAPI, React, and MongoDB

SQL for Data Analytics - Third Edition

MySQL Cookbook, 4th Edition

Powering Up the Business with a Lakehouse

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

What's new in Airflow 2.3?

Discover And De-Clutter Your Unstructured Data With Aparavi

Logging in Action

Practical SQL, 2nd Edition

Snowflake Essentials: Getting Started with Big Data in the Cloud

Cloud-Native Microservices with Apache Pulsar: Build Distributed Messaging Microservices

PostGIS in Action, Third Edition

Developing Modern Applications with a Converged Database

Data Science at the Command Line, 2nd Edition

Customizing Xcom to enhance data sharing between tasks

Pro Power BI Theme Creation: JSON Stylesheets for Automated Dashboard Formatting

Cleaning Data for Effective Data Science

Learn FileMaker Pro 19: The Comprehensive Guide to Building Custom Databases

Self Service Open Source Data Integration With AirByte