SQL

#106: SQL and the Digital Analyst with Pawel Kapuscinski

2019-01-15 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Pawel Kapuscinski (Analytics Pros) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Analytics Python

WHERE were you the first time you listened to this podcast? Did you feel like you were JOINing a SELECT GROUP BY doing so? Can you COUNT the times you've thought to yourself, "Wow. These guys are sometimes really unFILTERed?" On this episode, Pawel Kapuscinski from Analytics Pros (and the Burnley Football Club) sits down with the group to shout at them in all caps. Or, at least, to talk about SQL: where it fits in the analyst's toolbox, how it is a powerful and necessary complement to Python and R, and who's to blame for the existence of so many different flavors of the language. Give it a listen. That's an ORDER (BY?)! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

2019-01-14 · Data Engineering Podcast Listen

podcast_episode

by Mike Freedman (Timescale) , Ajay Kulkarni (Timescale) , Tobias Macey

Analytics Flink AWS Amazon RDS Data Engineering Data Lake Data Management DevOps GitHub Grafana Hadoop IoT +8 more

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?

What were the most challenging aspects of reaching that goal?

In terms of timeseries workloads, what are some of the factors that differ across varying use cases?

How do those differences impact the ways in which Timescale is used by the end user, and built by your team?

What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?

Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?

What is in store for the future of the Timescale product and organization?

Contact Info

Ajay

@acoustik on Twitter LinkedIn

Mike

LinkedIn Website @michaelfreedman on Twitter

Timescale

Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL

Podcast Interview

RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar

Podcast Episode

Spark

Podcast Episode

Flink

Podcast Episode

Hadoop DevOps PipelineDB

Podcast Interview

Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Performing Fast Data Analytics Using Apache Kudu - Episode 64

2019-01-07 · Data Engineering Podcast Listen

podcast_episode

by Brock Noland (PhData) , Jordan Birdsell (PhData) , Tobias Macey

Analytics Azure Data Analytics Data Engineering Data Management ETL/ELT GitHub Hadoop Apache HBase HDFS Hive Iceberg +5 more

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it?

How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format?

What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?

What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?

A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu?

If you were to start the project over today what would you do differently?

What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu?

Contact Info

Brock

LinkedIn @brocknoland on Twitter

Jordan

LinkedIn @jordanbirdsell jbirdsell on GitHub

PhData

Website phdata on GitHub @phdatainc on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet

Podcast Episode

ORC HBase Spark

Podcast Episode

Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server

2018-12-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Edward Pollack

Analytics BI Microsoft Cyber Security SQL Server data data-engineering microsoft-sql-server relational-databases

Take a deep dive into the many uses of dynamic SQL in Microsoft SQL Server. This edition has been updated to use the newest features in SQL Server 2016 and SQL Server 2017 as well as incorporating the changing landscape of analytics and database administration. Code examples have been updated with new system objects and functions to improve efficiency and maintainability. Executing dynamic SQL is key to large-scale searching based on user-entered criteria. Dynamic SQL can generate lists of values and even code with minimal impact on performance. Dynamic SQL enables dynamic pivoting of data for business intelligence solutions as well as customizing of database objects. Yet dynamic SQL is feared by many due to concerns over SQL injection or code maintainability. Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server helps you bring the productivity and user-satisfaction of flexible and responsive applications to your organization safely and securely. Your organization’s increased ability to respond to rapidly changing business scenarios will build competitive advantage in an increasingly crowded and competitive global marketplace. With a focus on new applications and modern database architecture, this edition illustrates that dynamic SQL continues to evolve and be a valuable tool for administration, performance optimization, and analytics. What You'ill Learn Build flexible applications that respond to changing business needs Take advantage of creative, innovative, and productive uses of dynamic SQL Know about SQL injection and be confident in your defenses against it Address performance concerns in stored procedures and dynamic SQL Troubleshoot and debug dynamic SQL to ensure correct results Automate your administration of features within SQL Server Who This Book is For Developers and database administrators looking to hone and build their T-SQL coding skills. The book is ideal for developers wanting to plumb the depths of application flexibility and troubleshoot performance issues involving dynamic SQL. The book is also ideal for programmers wanting to learn what dynamic SQL is about and how it can help them deliver competitive advantage to their organizations.

Apache Spark 2: Data Processing and Real-Time Analytics

2018-12-21 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Romeo Kienzler , Sridhar Alla , Md. Rezaul Karim , Siamak Amirghodsi

AI/ML Analytics Big Data Data Analytics Scala Spark Data Streaming apache-spark data data-engineering

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework Key Features Master the art of real-time big data processing and machine learning Explore a wide range of use-cases to analyze large data Discover ways to optimize your work by using many features of Spark 2.x and Scala Book Description Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: Mastering Apache Spark 2.x by Romeo Kienzler Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook What you will learn Get to grips with all the features of Apache Spark 2.x Perform highly optimized real-time big data processing Use ML and DL techniques with Spark MLlib and third-party tools Analyze structured and unstructured data using SparkSQL and GraphX Understand tuning, debugging, and monitoring of big data applications Build scalable and fault-tolerant streaming applications Develop scalable recommendation engines Who this book is for If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

Apache Superset Quick Start Guide

2018-12-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Shashank Shekhar

BI Dashboard DataViz MySQL Cyber Security Superset data data-engineering relational-databases

Apache Superset Quick Start Guide teaches you how to leverage Apache Superset to create interactive and insightful data visualizations. With this book, you'll understand how to integrate Superset with popular databases and build user-friendly dashboards tailored for business intelligence needs. What this Book will help me do Set up and configure Apache Superset for data visualization tasks. Integrate data from SQL databases into Superset for dashboards. Design dashboards tailored to represent business metrics and insights. Use Superset's visualization techniques to explore and present various datasets. Understand and apply user role management and security features in Superset. Author(s) None Shekhar is an experienced data visualization and business intelligence specialist with years of experience in working with Apache Superset. They have written several guides on utilizing open-source tools for enterprise needs. Their technical expertise and approachable writing style make this guide practical and engaging. Who is it for? This book is geared towards data analysts, business intelligence professionals, and developers. Beginners to Superset can quickly grasp the fundamentals, while those with prior experience in data visualization will appreciate the advanced techniques. It's perfect for anyone looking to enhance their data storytelling and dashboard design skills.

Practical Apache Spark: Using the Scala API

2018-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dharanitharan Ganesan , Subhashini Chellappan

AI/ML API Hive Kafka Scala Spark Data Streaming apache-spark data data-engineering

Work with Apache Spark using Scala to deploy and set up single-node, multi-node, and high-availability clusters. This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic. Practical Apache Spark also covers the integration of Apache Spark with Kafka with examples. You’ll follow a learn-to-do-by-yourself approach to learning – learn the concepts, practice the code snippets in Scala, and complete the assignments given to get an overall exposure. On completion, you’ll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You’ll also become familiar with machine learning algorithms with real-time usage. What You Will Learn Discover the functional programming features of Scala Understand the completearchitecture of Spark and its components Integrate Apache Spark with Hive and Kafka Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queries Work with different machine learning concepts and libraries using Spark's MLlib packages Who This Book Is For Developers and professionals who deal with batch and stream data processing.

SQL For Dummies, 9th Edition

2018-12-11 · O'Reilly SQL Books O'Reilly Amazon

book

by Allen G. Taylor

Cloud Computing MySQL Oracle RDBMS

Get ready to make SQL easy! Updated for the latest version of SQL, the new edition of this perennial bestseller shows programmers and web developers how to use SQL to build relational databases and get valuable information from them. Covering everything you need to know to make working with SQL easier than ever, topics include how to use SQL to structure a DBMS and implement a database design; secure a database; and retrieve information from a database; and much more. SQL is the international standard database language used to create, access, manipulate, maintain, and store information in relational database management systems (DBMS) such as Access, Oracle, SQL Server, and MySQL. SQL adds powerful data manipulation and retrieval capabilities to conventional languages—and this book shows you how to harness the core element of relational databases with ease. Server platform that gives you choices of development languages, data types, on-premises or cloud, and operating systems Find great examples on the use of temporal data Jump right in—without previous knowledge of database programming or SQL As database-driven websites continue to grow in popularity—and complexity— SQL For Dummies is the easy-to-understand, go-to resource you need to use it seamlessly.

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

2018-12-10 · Data Engineering Podcast Listen

podcast_episode

by Jean-Georges Perrin (Actian) , Tobias Macey

AI/ML Flink Data Engineering Data Management Databricks ETL/ELT GitHub Hadoop IBM Kafka Microsoft MySQL +3 more

Summary

Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Spark is?

What are some of the main use cases for Spark? What are some of the problems that Spark is uniquely suited to address? Who uses Spark?

What are the tools offered to Spark users? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? For someone building on top of Spark what are the main software design paradigms?

How does the design of an application change as you go from a local development environment to a production cluster?

Once your application is written, what is involved in deploying it to a production environment? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline? What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments? What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies? What are the limitations of the Spark programming model?

What are the cases where Spark is the wrong choice?

What was your motivation for writing a book about Spark?

Who is the target audience?

What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark? What advice do you have for anyone who is considering or currently using Spark?

Contact Info

@jgperrin on Twitter Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Book Discount

Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com

Links

Apache Spark Spark In Action Book code examples in GitHub Informix International Informix Users Group MySQL Microsoft SQL Server ETL (Extract, Transform, Load) Spark SQL and Spark In Action‘s chapter 11 Spark ML and Spark In Action‘s chapter 18 Spark Streaming (structured) and Spark In Action‘s chapter 10 Spark GraphX Hadoop Jupyter

Podcast Interview

Zeppelin Databricks IBM Watson Studio Kafka Flink

P

Hands-On Big Data Modeling

2018-11-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tao Wei , Suresh Kumar Mukhiya , Lee James (Domo)

BI Big Data Data Management Data Modelling Python data data-engineering data-models

This book, Hands-On Big Data Modeling, provides you with practical guidance on data modeling techniques, focusing particularly on the challenges of big data. You will learn the concepts behind various data models, explore tools and platforms for efficient data management, and gain hands-on experience with structured and unstructured data. What this Book will help me do Master the fundamental concepts of big data and its challenges. Explore advanced data modeling techniques using SQL, Python, and R. Design effective models for structured, semi-structured, and unstructured data types. Apply data modeling to real-world datasets like social media and sensor data. Optimize data models for performance and scalability in various big data platforms. Author(s) The authors of this book are experienced data architects and engineers with a strong background in developing scalable data solutions. They bring their collective expertise to simplify complex concepts in big data modeling, ensuring readers can effectively apply these techniques in their projects. Who is it for? This book is intended for data architects, business intelligence professionals, and any programmer interested in understanding and applying big data modeling concepts. If you are already familiar with basic data management principles and want to enhance your skills, this book is perfect for you. You will learn to tackle real-world datasets and create scalable models. Additionally, it is suitable for professionals transitioning to working with big data frameworks.

Hands-On Data Science with SQL Server 2017

2018-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vladimír Mužný , Marek Chmel

Analytics Azure BI Big Data Data Science Power BI Python data data-engineering

In "Hands-On Data Science with SQL Server 2017," you will discover how to implement end-to-end data analysis workflows, leveraging SQL Server's robust capabilities. This book guides you through collecting, cleaning, and transforming data, querying for insights, creating compelling visualizations, and even constructing predictive models for sophisticated analytics. What this Book will help me do Grasp the essential data science processes and how SQL Server supports them. Conduct data analysis and create interactive visualizations using Power BI. Build, train, and assess predictive models using SQL Server tools. Integrate SQL Server with R, Python, and Azure for enhanced functionality. Apply best practices for managing and transforming big data with SQL Server. Author(s) Marek Chmel and Vladimír Mužný bring their extensive experience in data science and database management to this book. Marek is a seasoned database specialist with a strong background in SQL, while Vladimír is known for his instructional expertise in analytics and data manipulation. Together, they focus on providing actionable insights and practical examples tailored for data professionals. Who is it for? This book is an ideal resource for aspiring and seasoned data scientists, data analysts, and database professionals aiming to deepen their expertise in SQL Server for data science workflows. Beginners with fundamental SQL knowledge will find it a guided entry into data science applications. It is especially suited for those who aim to implement data-driven solutions in their roles while leveraging SQL's capabilities.

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

2018-11-19 · Data Engineering Podcast Listen

podcast_episode

by Fabian Hueske (Data Artisans) , Tobias Macey

Flink Cloud Computing Data Engineering Data Management Dataflow GCP GitHub Hadoop IBM Java Kafka Scala +2 more

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?

What are some use cases that Flink is uniquely qualified to handle?

Where does Flink fit into the current data landscape? How is Flink architected?

How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today?

How does scaling work in a Flink deployment?

What are the scaling limits? What are some of the failure modes that users should be aware of?

How is the statefulness of a cluster managed?

What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster?

What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?

What do they find most interesting or exciting?

Contact Info

LinkedIn @fhueske on Twitter fhueske on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure

Securing SQL Server: DBAs Defending the Database

2018-11-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Peter A Carter

GDPR/CCPA Cyber Security SQL Server data data-engineering microsoft-sql-server relational-databases

Protect your data from attack by using SQL Server technologies to implement a defense-in-depth strategy for your database enterprise. This new edition covers threat analysis, common attacks and countermeasures, and provides an introduction to compliance that is useful for meeting regulatory requirements such as the GDPR. The multi-layered approach in this book helps ensure that a single breach does not lead to loss or compromise of confidential, or business sensitive data. Database professionals in today’s world deal increasingly with repeated data attacks against high-profile organizations and sensitive data. It is more important than ever to keep your company’s data secure. Securing SQL Server demonstrates how developers, administrators and architects can all play their part in the protection of their company’s SQL Server enterprise. This book not only provides a comprehensive guide to implementing the security model in SQLServer, including coverage of technologies such as Always Encrypted, Dynamic Data Masking, and Row Level Security, but also looks at common forms of attack against databases, such as SQL Injection and backup theft, with clear, concise examples of how to implement countermeasures against these specific scenarios. Most importantly, this book gives practical advice and engaging examples of how to defend your data, and ultimately your job, against attack and compromise. What You'll Learn Perform threat analysis Implement access level control and data encryption Avoid non-reputability by implementing comprehensive auditing Use security metadata to ensure your security policies are enforced Mitigate the risk of credentials being stolen Put countermeasures in place against common forms of attack Who This Book Is For Database administrators who need to understand and counteract the threat of attacks against their company’s data, and useful for SQL developers and architects

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

2018-11-11 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Yoni Iny (Upsolver)

API Avro CloudWatch Kinesis Cassandra Cloud Computing CSV Data Engineering Data Lake Data Management Datadog DevOps +13 more

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?

What are your goals for the platform?

There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?

What are the shortcomings of a data lake architecture?

How is Upsolver architected?

How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?

What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?

Contact Info

Yoni

yoniiny on GitHub LinkedIn

Upsolver

Website @upsolver on Twitter LinkedIn Facebook

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

2018-11-05 · Data Engineering Podcast Listen

podcast_episode

by Daniel Mintz (Looker) , Tobias Macey

AI/ML Airflow API Athena BI BigQuery Data Engineering Data Management DevOps DWH ETL/ELT Hadoop +10 more

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?

How do you define business intelligence?

How is Looker unique from other approaches to business intelligence in the enterprise?

How does it compare to open source platforms for BI?

Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?

How does that change for different user roles (e.g. data engineer vs sales management)

What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?

What are the portions of the Looker architecture that you would do differently if you were to start over today?

What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Learning Apache Drill

2018-11-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Paul Rogers , Charles Givre

AI/ML Cloud Computing CSV Hadoop Apache HBase HDFS Hive JSON Kafka MongoDB Parquet RDBMS +6 more

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster. In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight. Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drill’s functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning

Pro SQL Server on Linux: Including Container-Based Deployment with Docker and Kubernetes

2018-10-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

Docker Kubernetes Linux Microsoft Oracle Cyber Security SQL Server data data-engineering microsoft-sql-server postgresql relational-databases

Get SQL Server up and running on the Linux operating system and containers. No database professional managing or developing SQL Server on Linux will want to be without this deep and authoritative guide by one of the most respected experts on SQL Server in the industry. Get an inside look at how SQL Server for Linux works through the eyes of an engineer on the team that made it possible. Microsoft SQL Server is one of the leading database platforms in the industry, and SQL Server 2017 offers developers and administrators the ability to run a database management system on Linux, offering proven support for enterprise-level features and without onerous licensing terms. Organizations invested in Microsoft and open source technologies are now able to run a unified database platform across all their operating system investments. Organizations are further able to take full advantage of containerization through popular platforms such as Docker and Kubernetes. Pro SQL Server on Linux walks you through installing and configuring SQL Server on the Linux platform. The author is one of the principal architects of SQL Server for Linux, and brings a corresponding depth of knowledge that no database professional or developer on Linux will want to be without. Throughout this book are internals of how SQL Server on Linux works including an in depth look at the innovative architecture. The book covers day-to-day management and troubleshooting, including diagnostics and monitoring, the use of containers to manage deployments, and the use of self-tuning and the in-memory capabilities. Also covered are performance capabilities, high availability, and disaster recovery along with security and encryption. The book covers the product-specific knowledge to bring SQL Server and its powerful features to life on the Linux platform, including coverage of containerization through Docker and Kubernetes. What You'll Learn Learn about the history and internal of the unique SQL Server on Linux architecture. Install and configure Microsoft’s flagship database product on the Linux platform Manage your deployments using container technology through Docker and Kubernetes Know the basics of building databases, the T-SQL language, and developing applications against SQL Server on Linux Use tools and features to diagnose, manage, and monitor SQL Server on Linux Scale your application by learning the performance capabilities of SQL Server Deliver high availability and disaster recovery to ensure business continuity Secure your database from attack, and protect sensitive data through encryption Take advantage of powerful features such as Failover Clusters, Availability Groups, In-Memory Support, and SQL Server’sSelf-Tuning Engine Learn how to migrate your database from older releases of SQL Server and other database platforms such as Oracle and PostgreSQL Build and maintain schemas, and perform management tasks from both GUI and command line Who This Book Is For Developers and IT professionals who are new to SQL Server and wish to configure it on the Linux operating system. This book is also useful to those familiar with SQL Server on Windows who want to learn the unique aspects of managing SQL Server on the Linux platform and Docker containers. Readers should have a grasp of relational database concepts and be comfortable with the SQL language.

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

2018-10-09 · Data Engineering Podcast Listen

podcast_episode

by Nikita Shamgunov (Neon) , Tobias Macey

AI/ML API BI Cloud Computing Data Engineering Data Management Data Science DWH Tableau

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links MemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

#099: How Technical Should the Analyst Be? With Simo Ahava!

2018-10-09 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Simo Ahava (NetBooster, Helsinki - Finland) , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

AWS Cloud Computing GCP JavaScript Python

Are you deeply knowledgable in JavaScript, R, the DOM, Python, AWS, jQuery, Google Cloud Platform, and SQL? Good for you! If you're not, should you be? What does "technical" mean, anyway? And, is it even possible for an analyst to dive into all of these different areas? English philosophy expert The Notorious C.M.O. (aka, Simo Ahava) returns to the show to share his thoughts on the subject in this episode. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Expert SQL Server Transactions and Locking: Concurrency Internals for SQL Server Practitioners

2018-10-08 · O'Reilly SQL Books O'Reilly Amazon

book

by Dmitri Korotkevitch

Azure Microsoft microsoft sql server

Master SQL Server’s Concurrency Model so you can implement high-throughput systems that deliver transactional consistency to your application customers. This book explains how to troubleshoot and address blocking problems and deadlocks, and write code and design database schemas to minimize concurrency issues in the systems you develop. SQL Server’s Concurrency Model is one of the least understood parts of the SQL Server Database Engine. Almost every SQL Server system experiences hard-to-explain concurrency and blocking issues, and it can be extremely confusing to solve those issues without a base of knowledge in the internals of the Engine. While confusing from the outside, the SQL Server Concurrency Model is based on several well-defined principles that are covered in this book. Understanding the internals surrounding SQL Server’s Concurrency Model helps you build high-throughput systems in multi-user environments. This book guides you through the Concurrency Model and elaborates how SQL Server supports transactional consistency in the databases. The book covers all versions of SQL Server, including Microsoft Azure SQL Database, and it includes coverage of new technologies such as In-Memory OLTP and Columnstore Indexes. What You'll Learn Know how transaction isolation levels affect locking behavior and concurrency Troubleshoot and address blocking issues and deadlocks Provide required data consistency while minimizing concurrency issues Design efficient transaction strategies that lead to scalable code Reduce concurrency problems through good schema design Understand concurrency models for In-Memory OLTP and Columnstore Indexes Reduce blocking during index maintenance, batch data load, and similar tasks Who This Book Is For SQL Server developers, database administrators, and application architects who are developing highly-concurrent applications. The book is for anyone interested in the technical aspects of creating and troubleshooting high-throughput systems that respond swiftly to user requests.

talk-data.com

Activity Trend

Top Events

Top Speakers

#106: SQL and the Digital Analyst with Pawel Kapuscinski

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server

Apache Spark 2: Data Processing and Real-Time Analytics

Apache Superset Quick Start Guide

Practical Apache Spark: Using the Scala API

SQL For Dummies, 9th Edition

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Hands-On Big Data Modeling

Hands-On Data Science with SQL Server 2017

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Securing SQL Server: DBAs Defending the Database

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Learning Apache Drill

Pro SQL Server on Linux: Including Container-Based Deployment with Docker and Kubernetes

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

#099: How Technical Should the Analyst Be? With Simo Ahava!

Expert SQL Server Transactions and Locking: Concurrency Internals for SQL Server Practitioners