SQL

Learn PostgreSQL

2020-10-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Luca Ferrari (Bending Spoons) , Enrico Pirozzi

RDBMS data data-engineering postgresql relational-databases

Dive into the world of PostgreSQL, one of the most powerful and versatile open-source relational databases! This book guides you through all the essentials of PostgreSQL version 12 and 13, from installation to high-performance database deployments. You'll learn how to design schemas, perform database operations efficiently, and implement advanced functionalities. What this Book will help me do Install, configure, and monitor a PostgreSQL server for optimal performance. Implement SQL and PL/pgSQL scripts to build complex database solutions. Analyze and optimize database schemas and indexes for efficiency. Secure a PostgreSQL database and manage roles and permissions effectively. Set up high-availability configurations through replication techniques. Author(s) None Ferrari and Enrico Pirozzi are seasoned database professionals with extensive experience in PostgreSQL. They bring practical expertise and a real-world perspective to the subject, ensuring you get hands-on knowledge and apply it effectively. Their approachable writing style simplifies even the most complex database concepts. Who is it for? This book is perfect for database professionals, developers, or tech enthusiasts looking to gain mastery over PostgreSQL. Whether you are new to PostgreSQL or have a fundamental understanding of databases, you'll find this book highly insightful in achieving your database management goals.

ETL with Azure Cookbook

2020-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Cote , Madina Saitakhmetova , Matija Lah

Azure Big Data Cloud Computing Data Engineering Databricks ETL/ELT data data-engineering etl

ETL with Azure Cookbook is a comprehensive guide to building effective and scalable ETL solutions using the Azure cloud platform. Through hands-on recipes, this book explores the features and capabilities of Azure services for data integration and transformation, guiding you in creating efficient processes for moving and handling data. What this Book will help me do Master the basics and advanced techniques for building ETL processes on Azure. Learn practical skills in designing solutions that integrate multiple Azure services. Understand how to migrate existing on-premises ETL solutions to Azure successfully. Acquire knowledge of SQL Server and Azure Big Data Clusters for data integration. Gain experience in automating and optimizing data processes with BIML and Azure Databricks. Author(s) The authors of ETL with Azure Cookbook are experienced data engineers and Azure specialists with years of expertise in designing and implementing robust data solutions. Their professional journey includes hands-on work with SQL Server, Azure services, and scalable ETL frameworks. They aim to provide practical insights and actionable guidance to help readers achieve success in data engineering projects. Who is it for? This book is ideal for data architects, ETL developers, and IT professionals seeking to enhance their skills in data integration and transformation, particularly within the Azure ecosystem. It's suitable for individuals with some knowledge of data engineering principles, SQL, and familiarity with ETL processes who aim to adopt modern cloud-based approaches.

Metabase Up and Running

2020-09-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Tim Abraham

Analytics AWS BI Data Analytics Metabase business-intelligence data data-science

Metabase Up and Running is your go-to guide for mastering Metabase, the open-source business intelligence tool. You'll progress from the basics of installation and setup to connecting data sources and creating insightful visualizations and dashboards. By the end, you'll be confident in implementing Metabase in your organization for impactful decision-making. What this Book will help me do Understand how to securely deploy and configure Metabase on Amazon Web Services. Master the creation of dashboards, reports, and visual visualizations using Metabase's tools. Gain expertise in user and permissions management within Metabase. Learn to use Metabase's SQL console for advanced database interactions. Acquire skills to embed Metabase within applications and automate reports via email or Slack. Author(s) None Abraham, an experienced tool specialist, is passionate about teaching others how to leverage data tools effectively. With a background in business analytics, Abraham has guided companies of all sizes. Their approachable writing style ensures a learning journey that is both informative and engaging. Who is it for? This book is ideal for business analysts and data professionals looking to amplify their business intelligence capabilities using Metabase. Readers should have some understanding of data analytics principles. Whether you're starting in analytics or seeking advanced automation, this book offers valuable guidance to meet your goals.

Understanding Oracle APEX 20 Application Development: Think Like an Application Express Developer

2020-09-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Edward Sciore

Oracle Cyber Security data data-engineering oracle-database-solutions

This book shows developers and Oracle professionals how to build practical, non-trivial web applications using Oracle’s rapid application development environment – Application Express (APEX). This third edition Is revised to cover the new features and user interface experience found in APEX 20. Interactive grids and form regions are two of the newer aspects of APEX covered in this edition. The book is targeted at those who are new to APEX and just beginning to develop real projects for deployment, as well as those who are familiar with APEX and want a deeper understanding. The book takes you through the development of a demo web application that illustrates the concepts all APEX programmers should know. This book introduces the world of APEX properties, explaining the functionality supported by each page component as well as the techniques developers use to achieve that functionality. Topics include conditional formatting, user-customized reports, data entry forms, concurrency and lost updates, and security control. Specific attention is given in the book to the thought process involved in choosing and assembling APEX components and features to deliver a specific result. Understanding Oracle APEX 20 Application Development, 3rd Edition is the ideal book to take you from an understanding of the individual pieces of APEX to an understanding of how those pieces are assembled into polished applications. What You Will Learn Build attractive, highly functional web apps from the ground up Enhance and customize pages created by the APEX wizards Understand the security implications of page design Write PL/SQL code for process activity and verification Build complex components such as forms and interactive grids Who This Book Is For Developers new to APEXwho desire a strong fundamental understanding of how APEX applications work. For existing developers and database administrators desiring to mine the most value from APEX by improving their development techniques.

SQL Server 2019 Administrator's Guide - Second Edition

2020-09-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vladimír Mužný , Marek Chmel

Azure Big Data Cyber Security data data-engineering microsoft-sql-server relational-databases

SQL Server 2019 Administrator's Guide provides a complete walkthrough of administering, managing, and optimizing SQL Server 2019. You'll gain the expertise needed to implement secure and efficient database solutions suitable for enterprise-scale environments. This book systematically explores the tools, techniques, and best practices essential to mastering SQL Server 2019. What this Book will help me do Optimize database queries and design using indexing techniques to resolve performance issues effectively. Implement robust backup and recovery mechanisms following advanced security policies. Utilize SQL Server 2019 tools for automation in monitoring, maintaining, and managing health checks. Integrate SQL Server with Azure for Big Data processing and scalability. Set up highly available and stable Always-On environments for enterprise databases. Author(s) Marek Chmel and Vladimír Mužný are seasoned database administrators with years of hands-on experience in SQL Server and database infrastructure. Their collaborative writing approach emphasizes real-world scenarios and examples that make technical concepts accessible. With accolades in professional database education and a passion for teaching, they provide a guiding hand through complex database subjects. Who is it for? This book is ideal for database administrators, developers, and IT professionals who seek to enhance their expertise with SQL Server 2019. Readers should have a basic understanding of database principles and familiarity with prior versions of SQL Server. Whether you're stepping into advanced administration or seeking to fine-tune your enterprise database infrastructure, this book is tailored for you.

Simplify Your Data Architecture With The Presto Distributed SQL Engine

2020-09-07 · Data Engineering Podcast Listen

podcast_episode

by Martin Traverso (Facebook) , Tobias Macey

Big Data Cloud Computing Data Engineering Data Lake Data Management Data Modelling Kubernetes Presto Data Streaming

Summary Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story?

What was the motivation for releasing Presto as open source?

For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them?

What are the primary ways that Presto is being used?

I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth? What are some of the deployment and scaling considerations that operators of Presto should be aware of? What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations? What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution? When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success? What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used? What are the most interesting, unexpected, or challenging lessons that you have

Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js

2020-08-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sergio Guerrero

API Cloud Computing JavaScript JSON SAP Cyber Security data data-engineering

Build enterprise-grade microservices in the SAP HANA Advanced Model (XSA). This book explains building scalable APIs in XSA and the benefits of building microservices with SAP HANA XSA. This book covers the cloud foundry (CF) architecture and how SAP HANA XSA follows the model. It begins with the details of the different architectural layers of applications hosted in XSA (specifically, microservices). Everything you need to know is presented, including analyzing requests, modularization, database ingestion, building JSON responses, and scaling your microservices. You will learn to use developmental tools such as the SAP WEB IDE, POSTMAN, and the SAP HANA Cockpit for XSA, including debugging examples on SAP HANA XSA with code snippets showing how microservices can be developed, debugged, scaled, and deployed on SAP HANA XSA. Microservices are divided into security and authentication, request handling, modularization of Node.js, and interaction with the SAP HANA database containers and response formatting. An end-to-end scenario is presented of a Node.js REST API that uses HTTP methods, concluding with deploying an SAP HANA XSA project to a production environment. This book is simple enough to help you implement a Node.js module in order to understand the development of microservices, and complex enough for architects to design their next business-ready solution integrating UAA security, application modularization, and an end-to-end REST API on SAP HANA XSA. What You Will Learn Know the definition and architecture of cloud foundry and its application on SAP HANA XSA Understand REST principles and different HTTP methods Explore microservices (Node.js) development Database interaction from Node (executing SQL statements and stored procedures) Who This Book Is For Architects designing business-ready solutions that integrate UAA security, application modularization, and an end-to-end REST API on SAP HANA XSA

The Data Wrangling Workshop - Second Edition

2020-07-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Shubhadeep Roychowdhury , John Wesley Doyle , Harshil Jain , Samik Sen , Akshay Khare , Dr. Tirthajyoti Sarkar , Nagendra Nagaraj , Dr. Vlad Sebastian Ionescu , Robert Thas John , Brian Lipp

Analytics Data Quality Data Science Matplotlib NumPy Pandas Python RDBMS data data-science data-science-tools

The Data Wrangling Workshop is your beginner's guide to the essential techniques and practices of data manipulation using Python. Throughout the book, you will progressively build your skills, learning key concepts such as extracting, cleaning, and transforming data into actionable insights. By the end, you'll be confident in handling various data wrangling tasks efficiently. What this Book will help me do Understand and apply the fundamentals of data wrangling using Python. Combine and aggregate data from diverse sources like web data, SQL databases, and spreadsheets. Use descriptive statistics and plotting to examine dataset properties. Handle missing or incorrect data effectively to maintain data quality. Gain hands-on experience with Python's powerful data science libraries like Pandas, NumPy, and Matplotlib. Author(s) Brian Lipp, None Roychowdhury, and Dr. Tirthajyoti Sarkar are experienced educators and professionals in the fields of data science and engineering. Their collective expertise spans years of teaching and working with data technologies. They aim to make data wrangling accessible and comprehensible, focusing on practical examples to equip learners with real-world skills. Who is it for? The Data Wrangling Workshop is ideal for developers, data analysts, and business analysts aiming to become data scientists or analytics experts. If you're just getting started with Python, you will find this book guiding you step-by-step. A basic understanding of Python programming, as well as relational databases and SQL, is recommended for smooth learning.

Learning Spark, 2nd Edition

2020-07-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Brooke Wenig , Jules S. Damji (Anyscale Inc) , Tathagata Das (Databricks)

AI/ML Analytics API Avro CSV Data Analytics Delta Hive Java JSON Kafka ORC +9 more

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

SQL Injection Strategies

2020-07-15 · O'Reilly SQL Books O'Reilly Amazon

book

by Gabriele Lombari , Ettore Galluccio , Edoardo Caselli

IoT Cyber Security

SQL Injection Strategies is the go-to guide for understanding and mastering the concepts and practical aspects of SQL injection. You will comprehensively learn about the processes to identify vulnerabilities in web applications and databases, how to safely test for SQL injection, and strategies to defend against such attacks. The book balances theory and practice effectively, offering tools and techniques for both learning and application. What this Book will help me do Gain a firm understanding of what SQL injection is and how it affects web and mobile applications. Learn to set up a safe and effective environment for practicing SQL injection techniques. Discover manual and tool-assisted methods for testing and performing SQL injection. Understand defense measures to mitigate and defend against SQL injection vulnerabilities. Be able to apply SQL injection knowledge to secure various systems including web, mobile, and IoT platforms. Author(s) None Galluccio, Gabriele Lombari, and their co-authors are seasoned professionals with extensive experience in cybersecurity and web application development. Their expertise in identifying system vulnerabilities and devising comprehensive defense mechanisms is well-recognized. This book reflects their commitment to teaching practical security techniques needed in today's technology-driven world. Who is it for? This book is designed for penetration testers, cybersecurity enthusiasts, ethical hackers, and technology practitioners seeking to understand SQL injection. Beginners with no prior experience in SQL injection as well as intermediate-level users looking to deepen their knowledge will find value. It's ideal for anyone looking for practical, hands-on guidance in securing applications and learning about common vulnerabilities.

DataOps For Streaming Systems With Lenses.io

2020-07-06 · Data Engineering Podcast Listen

podcast_episode

by Andrew Stevenson (Lenses.io) , Tobias Macey

Analytics Big Data Cloud Computing Data Engineering Data Management Datadog DataOps Kubernetes SaaS Data Streaming

Summary There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Lenses is and the story behind it? What is your working definition for what constitutes DataOps?

How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?

What are the typical barriers to collaboration, and how does Lenses help with that?

Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it? What are the main challenges that you see engineers facing when working with s

Airflow as an elastic ETL tool

2020-07-01 · Airflow Summit 2020

session

by Vicente Rubén Del Pino Ruiz (UnitedHealth Group) , Hendrik Kleine (Optum)

AI/ML Airflow Docker ELK ETL/ELT

In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.

Building reuseable and trustworthy ELT pipelines (A templated approach)

2020-07-01 · Airflow Summit 2020

session

by Nehil Jain (SnapTravel)

Airflow dbt ETL/ELT Singer

To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability. The workflow is built using open source tools like Singer, Great Expectations, and DBT. Goals: Make EL T simple and fast to implement Validate your assumptions of the data before you make it available for use Allow analysts/data scientists add pain-free contributions to EL T using SQL Generate data documentation, failure logs for quick recovery, and fixes outages in your pipeline Target Audience: Approachable to any level of developer Novice data personals interested in starting ELT workflow and learning about different tools of the ecosystem Intermediate+ developers interested in supercharging their pipeline with Write Audit Publish pattern and reducing pipeline debt

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

2020-06-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Robert Ilijason

AI/ML Analytics AWS Azure Big Data Cloud Computing Confluence Data Analytics Databricks Hadoop Hive Microsoft +5 more

Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere fraction of what classical analytics solutions cost, while at the same time getting the results you need, incrementally faster. This book explains how the confluence of these pivotal technologies gives you enormous power, and cheaply, when it comes to huge datasets. You will begin by learning how cloud infrastructure makes it possible to scale your code to large amounts of processing units, without having to pay for the machinery in advance. From there you will learn how Apache Spark, an open source framework, can enable all those CPUs for data analytics use. Finally, you will see how services such as Databricks provide the power of Apache Spark, without you having to know anything aboutconfiguring hardware or software. By removing the need for expensive experts and hardware, your resources can instead be allocated to actually finding business value in the data. This book guides you through some advanced topics such as analytics in the cloud, data lakes, data ingestion, architecture, machine learning, and tools, including Apache Spark, Apache Hadoop, Apache Hive, Python, and SQL. Valuable exercises help reinforce what you have learned. What You Will Learn Discover the value of big data analytics that leverage the power of the cloud Get started with Databricks using SQL and Python in either Microsoft Azure or AWS Understand the underlying technology, and how the cloud and Apache Spark fit into the bigger picture See how these tools are used in the real world Run basic analytics, including machine learning, on billions of rows at a fraction of a cost or free Who This Book Is For Data engineers, data scientists, and cloud architects who want or need to run advanced analytics in the cloud. It is assumed that the reader has data experience, but perhaps minimal exposure to Apache Spark and Azure Databricks. The book is also recommended for people who want to get started in the analytics field, as it provides a strong foundation.

Spark in Action, Second Edition

2020-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jean-Georges Perrin (Actian)

AI/ML Analytics API Big Data ELK GitHub Hadoop IBM Java Python Scala Spark +4 more

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. About the Technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the Book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's Inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the Reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the Author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Quotes This book reveals the tools and secrets you need to drive innovation in your company or community. - Rob Thomas, IBM An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing. - Anupam Sengupta, GuardHat Inc. This book will help spark a love affair with distributed processing. - Conor Redmond, InComm Product Control Currently the best book on the subject! - Markus Breuer, Materna IPS

SQL Server on Azure Virtual Machines

2020-06-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joey D'Antoni , Tim Radney , Randolph West , Anthony Nocentino (Pure Storage) , Allan Hirt , John Martin , Louis Davidson

Azure Cloud Computing Linux Microsoft azure-sql-database data data-engineering relational-databases

Would you like to master deploying SQL Server in the cloud using Microsoft's Azure platform? With the hands-on guidance in this book, you'll explore how to set up and configure SQL Server on Azure Virtual Machines effectively. By the end, you'll have the knowledge to optimize, manage, and deploy your solutions. What this Book will help me do Understand platform availability for SQL Server in Azure Explore SQL Server IaaS and optimize its configuration Master deploying SQL Server on Linux and Windows in Azure Configure high-performance storage options tailored to SQL Server Learn disaster recovery strategies for SQL Server in Azure Author(s) Joey D'Antoni, Louis Davidson, Allan Hirt, and their co-authors bring years of experience in database management, cloud architecture, and technical writing. They aim to provide clear and actionable advice for working efficiently with SQL Server on Azure. Their insights come from real-world projects. Who is it for? This book is for developers, database administrators, and cloud architects who are looking to learn how to deploy SQL Server solutions on Azure Virtual Machines. If you are transitioning workloads to the cloud or need to manage or optimize such environments, this book will equip you with the skills you need. Basic SQL Server knowledge is helpful.

Building A Data Lake For The Database Administrator At Upsolver

2020-06-02 · Data Engineering Podcast Listen

podcast_episode

by Ori Rafael (Upsolver) , Tobias Macey , Yoni Iny (Upsolver)

AI/ML Analytics Flink CloudFormation AWS Lambda Azure BigQuery Data Engineering Data Lake Data Management DWH GDPR/CCPA +7 more

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of what a data lake is and what it is comprised of? We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?

How has Upsolver changed or evolved since we last spoke?

How has the evolution of the underlying technologies impacted your implementation and overall product strategy?

What are some of the common challenges that accompany a data lake implementation? How do those challenges influence the adoption or viability of a data lake? How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?

What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?

What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform? How is the SQL layer in Upsolver implemented?

What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.?

What are the main concepts that you need to educate your customers on? What are some of the pitfalls that users should be aware of? What features of your platform are often overlooked or underutilized which you think should be more widely adopted? What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver? What do you have planned for the future?

Contact Info

Ori

Yoni

yoniiny on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver

Podcast Episode

DBA == Database Administrator IDF == Israel Defense Forces Data Lake Eventual Consistency Apache Spark Redshift Spectrum Azure Synapse Analytics SnowflakeDB

Podcast Episode

BigQuery Presto

Podcast Episode

Apache Kafka Cartesian Product kSQLDB

Podcast Episode

Eventador

Podcast Episode

Materialize

Podcast Episode

Common Table Expressions Lambda Architecture Kappa Architecture Apache Flink

Podcast Episode

Reinforcement Learning Cloudformation GDPR

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Learn SQL Database Programming

2020-05-29 · O'Reilly SQL Books O'Reilly Amazon

book

by Josephine Bush

MySQL RDBMS

Learn SQL Database Programming is your comprehensive guide to mastering SQL and its applications in database management. With step-by-step instructions, you'll gain confidence in querying and manipulating data, covering both fundamental and advanced SQL techniques. By working through this book, you'll acquire in-demand skills for organizing, analyzing, and presenting data effectively. What this Book will help me do Install and configure MySQL tools to create and manage databases efficiently. Utilize SQL commands to query and retrieve data from simple or complex datasets. Manipulate data securely using commands like INSERT, UPDATE, and DELETE. Master advanced SQL techniques including joins, subqueries, and flow controls. Apply best practices in SQL queries to design databases with optimal performance. Author(s) Josephine Bush is an experienced database developer and technical educator with a strong background in SQL programming. She has years of practical experience working with relational databases, and her teaching is grounded in real-world applications. She excels at explaining complex concepts clearly and emphasizing hands-on learning. Who is it for? This book is ideal for business analysts, aspiring SQL developers, database administrators, and students entering the field of SQL programming. It caters to beginners with no prior SQL experience, providing a structured and practical approach to learning. If you're eager to organize data or administer databases effectively, this book is for you.

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

2020-05-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Benjamin Weissman , Enrico van de Laar

AI/ML Analytics BI Big Data Cloud Computing Data Analytics Data Lake HDFS Kubernetes Linux Spark Data Streaming +4 more

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database. Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark. Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis. What You Will Learn Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments Analyze large volumes of data directly from SQL Server and/or Apache Spark Manage data stored in HDFS from SQL Server as if it wererelational data Implement advanced analytics solutions through machine learning and AI Expose different data sources as a single logical source using data virtualization Who This Book Is For Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

Introducing Microsoft SQL Server 2019

2020-04-27 · O'Reilly SQL Books O'Reilly Amazon

book

by James Rowland-Jones , Mitchell Pearson , Arun Sirpal , Dave Noderer , Dustin Ryan , Kellyn Gorman , Buck Woody , Allan Hirt

Analytics Azure BI Big Data Cloud Computing Data Management Docker Hadoop HDFS Kubernetes Microsoft NoSQL +4 more

Introducing Microsoft SQL Server 2019 is the must-have guide for database professionals eager to leverage the latest advancements in SQL Server 2019. This book covers the features and capabilities that make SQL Server 2019 a powerful tool for managing and analyzing data both on-premises and in the cloud. What this Book will help me do Understand the new features introduced in SQL Server 2019 and their practical applications. Confidently manage and analyze relational, NoSQL, and big data within SQL Server 2019. Implement containerization for SQL Server using Docker and Kubernetes. Migrate and integrate your databases effectively to use Power BI Report Server. Query data from Hadoop Distributed File System with Azure Data Studio. Author(s) The authors of 'Introducing Microsoft SQL Server 2019' are subject matter experts including Kellyn Gorman, Allan Hirt, and others. With years of professional experience in database management and SQL Server, they bring a wealth of practical insight and knowledge to the book. Their experience spans roles as administrators, architects, and educators in the field. Who is it for? This book is aimed at database professionals such as DBAs, architects, and big data engineers who are currently using earlier versions of SQL Server or other database platforms. It is particularly well-suited for professionals aiming to understand and implement SQL Server 2019's new features. Readers should have basic familiarity with SQL Server and RDBMS concepts. If you're looking to explore SQL Server 2019 to improve data management and analytics in your organization, this book is for you.

talk-data.com

Activity Trend

Top Events

Top Speakers

Learn PostgreSQL

ETL with Azure Cookbook

Metabase Up and Running

Understanding Oracle APEX 20 Application Development: Think Like an Application Express Developer

SQL Server 2019 Administrator's Guide - Second Edition

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js

The Data Wrangling Workshop - Second Edition

Learning Spark, 2nd Edition

SQL Injection Strategies

DataOps For Streaming Systems With Lenses.io

Airflow as an elastic ETL tool

Building reuseable and trustworthy ELT pipelines (A templated approach)

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

Spark in Action, Second Edition

SQL Server on Azure Virtual Machines

Building A Data Lake For The Database Administrator At Upsolver

Learn SQL Database Programming

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Introducing Microsoft SQL Server 2019