talk-data.com talk-data.com

Topic

SQL

Structured Query Language (SQL)

database_language data_manipulation data_definition programming_language

1751

tagged

Activity Trend

107 peak/qtr
2020-Q1 2026-Q1

Activities

1751 activities · Newest first

Redash v5 Quick Start Guide

In the 'Redash v5 Quick Start Guide', you'll learn everything you need to master the Redash data visualization platform and confidently create compelling dashboards. This book covers how to connect to different data sources, use SQL to query data, and design and share insightful visualizations. What this Book will help me do Understand how to install, configure, and troubleshoot Redash for your data projects. Gain skills in managing user roles and permissions to ensure secure data collaboration. Learn to connect Redash to various data sources and fetch, process, and handle data. Master the creation of advanced visualizations to effectively present complex data. Develop proficiency in utilizing the Redash API for integrating programmatic interactions. Author(s) None Leibzon is a recognized expert in data visualization and Business Intelligence tools, with years of experience working with data-driven systems. Drawing from his deep practical knowledge of Redash and its applications, None has crafted this guide to be accessible and highly practical. His goal is to enable learners and professionals to unlock the power of data storytelling through intuitive and actionable visualization. Who is it for? If you're a Data Analyst, BI professional, or Data Developer with basic SQL skills, this book is tailored for you. It assumes no prior knowledge of Redash but benefits those who understand fundamental Business Intelligence concepts. Whether you're looking to create your first visualization or streamline data collaboration, this guide will help you achieve your goals.

R Programming Fundamentals

Master the essentials of programming with R and streamline your data analysis workflow with 'R Programming Fundamentals'. This book introduces key R concepts like data structures and control flow, and guides you through practical applications such as data visualization with ggplot2. By the end, you will progress to completing a full data science project for practical hands-on experience. What this Book will help me do Learn to use R's core features, including package management, data structures, and control flow. Process and clean datasets effectively within R, handling missing values and variable transformation. Master data visualization techniques with ggplot2 to create insightful plots and charts. Develop skills to import diverse datasets such as CSVs, Excel spreadsheets, and SQL databases into R. Construct a data science project end-to-end, applying skills in analysis, visualization, and reporting. Author(s) Kaelen Medeiros is a dedicated teacher with a passion for making complex concepts accessible. Bringing years of experience in data science and statistical computing, Kaelen excels at helping learners understand and leverage R for their data analysis needs. With a focus on practical learning, Kaelen has designed this book to give you the hands-on experience and foundational knowledge you need. Who is it for? This book is perfect for analysts looking to enhance their data science toolkit by learning R. It's especially suited for those with little R programming experience looking to start with foundational concepts. Whether you're an aspiring data scientist or a seasoned professional seeking a refresher, this book offers a structured approach to mastering R effectively.

Kafka Streams in Action

Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort. About the Technology Not all stream-based applications require a dedicated processing cluster. The lightweight Kafka Streams library provides exactly the power and simplicity you need for message handling in microservices and real-time event processing. With the Kafka Streams API, you filter and transform data streams with just Kafka and your application. About the Book Kafka Streams in Action teaches you to implement stream processing within the Kafka platform. In this easy-to-follow book, you’ll explore real-world examples to collect, transform, and aggregate data, work with multiple processors, and handle real-time events. You’ll even dive into streaming SQL with KSQL! Practical to the very end, it finishes with testing and operational aspects, such as monitoring and debugging. What's Inside Using the KStreams API Filtering, transforming, and splitting data Working with the Processor API Integrating with external systems About the Reader Assumes some experience with distributed systems. No knowledge of Kafka or streaming applications required. About the Author Bill Bejeck is a Kafka Streams contributor and Confluent engineer with over 15 years of software development experience. Quotes A great way to learn about Kafka Streams and how it is a key enabler of event-driven applications. - From the Foreword by Neha Narkhede, Cocreator of Apache Kafka A comprehensive guide to Kafka Streams—from introduction to production! - Bojan Djurkovic, Cvent Bridges the gap between message brokering and real-time streaming analytics. - Jim Mantheiy Jr., Next Century Valuable both as an introduction to streams as well as an ongoing reference. - Robin Coe, TD Bank

Summary

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?

What types of data are you focused on supporting? What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?

Is there any need for an Elasticsearch cluster in addition to Chaos Search? For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3? What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL? Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS? What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster? What is the system architecture that you have built to allow for querying terabytes of data in S3?

What are the biggest contributors to query latency and what have you done to mitigate them?

What are the options for access control when running queries against the data stored in S3? What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen? What are your plans for the future of Chaos Search?

Contact Info

Pete Cheslock

@petecheslock on Twitter Website

Thomas Hazel

@thomashazel on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tool

Power BI Data Analysis and Visualization

Power BI Data Analysis and Visualization provides a roadmap to vendor choices and highlights why Microsoft’s Power BI is a very viable, cost effective option for data visualization. The book covers the fundamentals and most commonly used features of Power BI, but also includes an in-depth discussion of advanced Power BI features such as natural language queries; embedding Power BI dashboards; and live streaming data. It discusses real solutions to extract data from the ERP application, Microsoft Dynamics CRM, and also offers ways to host the Power BI Dashboard as an Azure application, extracting data from popular data sources like Microsoft SQL Server and open-source PostgreSQL. Authored by Microsoft experts, this book uses real-world coding samples and screenshots to spotlight how to create reports, embed them in a webpage, view them across multiple platforms, and more. Business owners, IT professionals, data scientists, and analysts will benefit from this thorough presentation of Power BI and its functions.

SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance

Identify and fix causes of poor performance. You will learn Query Store, adaptive execution plans, and automated tuning on the Microsoft Azure SQL Database platform. Anyone responsible for writing or creating T-SQL queries will find valuable the insight into bottlenecks, including how to recognize them and eliminate them. This book covers the latest in performance optimization features and techniques and is current with SQL Server 2017. If your queries are not running fast enough and you’re tired of phone calls from frustrated users, then this book is the answer to your performance problems. SQL Server 2017 Query Performance Tuning is about more than quick tips and fixes. You’ll learn to be proactive in establishing performance baselines using tools such as Performance Monitor and Extended Events. You’ll recognize bottlenecks and defuse them before the phone rings. You’ll learn some quick solutions too, but emphasis is on designing for performance and getting it right. The goal is to head off trouble before it occurs. What You'll Learn Use Query Store to understand and easily change query performance Recognize and eliminate bottlenecks leading to slow performance Deploy quick fixes when needed, following up with long-term solutions Implement best practices in T-SQL to minimize performance risk Design in the performance that you need through careful query and index design Utilize the latest performance optimization features in SQL Server 2017 Protect query performance during upgrades to the newer versions of SQL Server Who This Book Is For Developers and database administrators with responsibility for application performance in SQL Server environments. Anyone responsible for writing or creating T-SQL queries will find valuable the insight into bottlenecks, including how to recognize them and eliminate them.

Data Science with SQL Server Quick Start Guide

"Data Science with SQL Server Quick Start Guide" introduces you to leveraging SQL Server's most recent features for data science projects. You will explore the integration of data science techniques using R, Python, and Transact-SQL within SQL Server's environment. What this Book will help me do Use SQL Server's capabilities for data science projects effectively. Understand and preprocess data using SQL queries and statistics. Design, train, and evaluate machine learning models in SQL Server. Visualize data insights through advanced graphing techniques. Deploy and utilize machine learning models within SQL Server environments. Author(s) Dejan Sarka is a data science and SQL Server expert with years of industry experience. He specializes in melding database systems with advanced analytics, offering practical guidance through real-world scenarios. His writing provides clear, step-by-step methods, making complex topics accessible. Who is it for? This book is tailored for professionals familiar with SQL Server who are looking to delve into data science. It is also ideal for data scientists aiming to incorporate SQL Server into their analytics workflows. The content assumes basic exposure to SQL Server, ensuring a straightforward learning curve for its audience.

SQL Server Advanced Data Types: JSON, XML, and Beyond

Deliver advanced functionality faster and cheaper by exploiting SQL Server's ever-growing amount of built-in support for modern data formats. Learn about the growing support within SQL Server for operations and data transformations that have previously required third-party software and all the associated licensing and development costs. Benefit through a better understanding of what can be done inside the database engine with no additional costs or development time invested in outside software. Widely used types such as JSON and XML are well-supported by the database engine. The same is true of hierarchical data and even temporal data. Knowledge of these advanced types is crucial to unleashing the full power that's available from your organization's SQL Server database investment. SQL Server Advanced Data Types explores each of the complex data types supplied within SQL Server. Common usage scenarios for eachcomplex data type are discussed, followed by a detailed discussion on how to work with each data type. Each chapter demystifies the complex data and you learn how to use the data types most efficiently. The book offers a practical guide to working with complex data, using real-world examples to demonstrate how each data type can be leveraged. Performance considerations are also discussed, including the implementation of special indexes such as XML indexes and spatial indexes. What You'll Learn Understand the implementation of basic data types and why using the correct type is so important Work with XML data through the XML data type Construct XML data from relational result sets Store and manipulate JSON data using the JSON data type Model and analyze spatial data for geographic information systems Define hierarchies and query them efficiently through the HierarchyID type Who This Book Is For SQL Server developers and application developers who need to store and access complex data structures

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it. Along the way, you’ll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you’ll learn the fundamentals of Spark ML for machine learning and much more. After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications. What You Will Learn Understand Spark unified data processing platform Howto run Spark in Spark Shell or Databricks Use and manipulate RDDs Deal with structured data using Spark SQL through its operations and advanced functions Build real-time applications using Spark Structured Streaming Develop intelligent applications with the Spark Machine Learning library Who This Book Is For Programmers and developers active in big data, Hadoop, and Java but who are new to the Apache Spark platform.

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers

Interview

Introduction How did you get involved in the area of data management? How did you get involved in the Postgres project? For anyone who hasn’t used it, can you describe what PostgreSQL is?

Where did Postgres get started and how has it evolved over the intervening years?

What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?

What are some cases where Postgres is the wrong choice?

What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience) The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities? What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer? Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB? What is in store for the future of Postgres?

Contact Info

@jkatz05 on Twitter jkatz on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

PostgreSQL Crunchy Data Venuebook Paperless Post LAMP Stack MySQL PHP SQL ORDBMS Edgar Codd A Relational Model of Data for Large Shared Data Banks Relational Algebra Oracle DB UC Berkeley Dr. Michae

Healthcare Analytics Made Simple

Navigate the fascinating intersection of healthcare and data science with the book "Healthcare Analytics Made Simple." This comprehensive guide empowers you to use Python and machine learning techniques to analyze and improve real healthcare systems. Demystify intricate concepts with Python code and SQL to gain actionable insights and build predictive models for healthcare. What this Book will help me do Understand healthcare incentives, policies, and datasets to ground your analysis in practical knowledge. Master the use of Python libraries and SQL for healthcare data analysis and visualization. Develop skills to apply machine learning for predictive and descriptive analytics in healthcare. Learn to assess quality metrics and evaluate provider performance using robust tools. Get acquainted with upcoming trends and future applications in healthcare analytics. Author(s) The authors, None Kumar and None Khader, are experts in data science and healthcare informatics. They bring years of experience teaching, researching, and applying data analytics in healthcare. Their approach is hands-on and clear, aiming to make complex topics accessible and engaging for their audience. Who is it for? This book is perfect for data science professionals eager to specialize in healthcare analytics. Additionally, clinicians aiming to leverage computing and data analytics in improving healthcare processes will find valuable insights. Programming enthusiasts and students keen to enter healthcare analytics will also greatly benefit. Tailored for beginners in this field, it is an educational yet robust resource.

Professional Azure SQL Database Administration

Learn everything you need to manage Azure SQL Database with 'Professional Azure SQL Database Administration'. This book covers critical tasks such as migration, performance optimization, security, and disaster recovery. Perfect for those transitioning to the cloud, it equips you with skills to ensure your database runs smoothly and efficiently. What this Book will help me do Effectively migrate on-premise SQL Server databases to Azure. Master backup, restore, and security operations with Azure SQL Database. Optimize performance and scalability using monitoring and tuning techniques. Implement high availability and disaster recovery strategies. Simplify database management through automation and advanced techniques. Author(s) Ahmad Osama is a seasoned database admin and Azure expert with extensive experience in SQL Server and cloud database management. As a consultant and trainer, he has guided numerous organizations through cloud transitions. Ahmad's teaching philosophy blends practical insights with clear instruction. Who is it for? This book is intended for database administrators and developers looking to transition their skills to Azure SQL Database. If you have some experience with on-premise SQL Server and are familiar with PowerShell, you'll find this guide invaluable. Ideal for those wanting to develop, migrate, or manage Azure SQL solutions.

Streaming Systems

Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way. Expanded from Tyler Akidau’s popular blog posts "Streaming 101" and "Streaming 102", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax. You’ll explore: How streaming and batch data processing patterns compare The core principles and concepts behind robust out-of-order data processing How watermarks track progress and completeness in infinite datasets How exactly-once data processing techniques ensure correctness How the concepts of streams and tables form the foundations of both batch and streaming data processing The practical motivations behind a powerful persistent state mechanism, driven by a real-world example How time-varying relations provide a link between stream processing and the world of SQL and relational algebra

Learning SAS by Example

Learn to program SAS by example! Learning SAS by Example, A Programmer’s Guide, Second Edition, teaches SAS programming from very basic concepts to more advanced topics. Because most programmers prefer examples rather than reference-type syntax, this book uses short examples to explain each topic. The second edition has brought this classic book on SAS programming up to the latest SAS version, with new chapters that cover topics such as PROC SGPLOT and Perl regular expressions. This book belongs on the shelf (or e-book reader) of anyone who programs in SAS, from those with little programming experience who want to learn SAS to intermediate and even advanced SAS programmers who want to learn new techniques or identify new ways to accomplish existing tasks. In an instructive and conversational tone, author Ron Cody clearly explains each programming technique and then illustrates it with one or more real-life examples, followed by a detailed description of how the program works. The text is divided into four major sections: Getting Started, DATA Step Processing, Presenting and Summarizing Your Data, and Advanced Topics. Subjects addressed include Reading data from external sources Learning details of DATA step programming Subsetting and combining SAS data sets Understanding SAS functions and working with arrays Creating reports with PROC REPORT and PROC TABULATE Getting started with the SAS macro language Leveraging PROC SQL Generating high-quality graphics Using advanced features of user-defined formats and informats Restructuring SAS data sets Working with multiple observations per subject Getting started with Perl regular expressions You can test your knowledge and hone your skills by solving the problems at the end of each chapter.

Apache Hive Essentials - Second Edition

"Apache Hive Essentials" provides a focused guide to mastering the essential techniques of processing and analyzing big data with Apache Hive. What this Book will help me do Set up and configure a Hive environment for big data analysis. Compose effective queries using Hive's SQL-like language to extract insights. Optimize Hive performance to handle complex datasets efficiently. Implement data security and user-defined functions to extend capabilities. Integrate Hive with Hadoop tools for comprehensive data solutions. Author(s) Dayong Du, the author of "Apache Hive Essentials," has years of experience working with big data technologies and tools. With hands-on expertise in Hadoop and the entire ecosystem, he brings a practical and informed perspective to this complex field. His approach is to make these technologies accessible to developers and analysts of all levels. Who is it for? This book is perfect for data analysts, developers, or professionals familiar with SQL who are looking to start with Apache Hive for big data processing. It is suitable for those acquainted with Hadoop and its environment and want to expand their skills into efficient data querying and management. Readers should have an interest in how to leverage big data tools for real-world solutions.

Introducing the MySQL 8 Document Store

Learn the new Document Store feature of MySQL 8 and build applications around a mix of the best features from SQL and NoSQL database paradigms. Don’t allow yourself to be forced into one paradigm or the other, but combine both approaches by using the Document Store. MySQL 8 was designed from the beginning to bridge the gap between NoSQL and SQL. Oracle recognizes that many solutions need the capabilities of both. More specifically, developers need to store objects as loose collections of schema-less documents, but those same developers also need the ability to run structured queries on their data. With MySQL 8, you can do both! Introducing the MySQL 8 Document Store presents new tools and features that make creating a hybrid database solution far easier than ever before. This book covers the vitally important MySQL Document Store, the new X Protocol for developing applications, and a new client shell called the MySQL Shell. Also covered are supporting technologies and concepts such as JSON, schema-less documents, and more. The book gives insight into how features work and how to apply them to get the most out of your MySQL experience. The book covers topics such as: The headline feature in MySQL 8 MySQL's answer to NoSQL New APIs and client protocols What You'll Learn Create NoSQL-style applications by using the Document Store Mix the NoSQL and SQL approaches by using each to its best advantage in a hybrid solution Work with the new X Protocol for application connectivity in MySQL 8 Master the new X Developer Application Programming Interfaces Combine SQL and JSON in the same database and application Migrate existing applications to MySQL Document Store Who This Book Is For Developers and database professionals wanting to learn about the most profound paradigm-changing features of the MySQL 8 Document Store

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data

Interview

Introduction How did you get involved in the area of data management? Can you start by giving a brief overview of Heap? One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data? Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there? Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?

What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?

What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?

What challenges does that pose in your processing architecture?

What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?

How has that architecture changed or evolved over the life of the company? What are some changes that you are anticipating in the near future?

Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails? What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap? What changes have been necessary as a result of GDPR? What are your plans for the future of Heap?

Contact Info

@danlovesproofs on twitter [email protected] @drob on github heapanalytics.com / @heap on twitter https://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data manageme

SQL Primer: An Accelerated Introduction to SQL Basics

Build a core level of competency in SQL so you can recognize the parts of queries and write simple SQL statements. SQL knowledge is essential for anyone involved in programming, data science, and data management. This book covers features of SQL that are standardized and common across most database vendors. You will gain a base of knowledge that will prepare you to go deeper into the specifics of any database product you might encounter. Examples in the book are worked in PostgreSQL and SQLite, but the bulk of the examples are platform agnostic and will work on any database platform supporting SQL. Early in the book you learn about table design, the importance of keys as row identifiers, and essential query operations. You then move into more advanced topics such as grouping and summarizing, creating calculated fields, joining data from multiple tables when it makes business sense to do so, and more. Throughout the book, you are exposed to a set-based approachto the language and are provided a good grounding in subtle but important topics such as the effects of null value on query results. With the explosion of data science, SQL has regained its prominence as a top skill to have for technologists and decision makers worldwide. SQL Primer will guide you from the very basics of SQL through to the mainstream features you need to have a solid, working knowledge of this important, data-oriented language. What You'll Learn Create and populate your own database tables Read SQL queries and understand what they are doing Execute queries that get correct results Bring together related rows from multiple tables Group and sort data in support of reporting applications Get a grip on nulls, normalization, and other key concepts Employ subqueries, unions, and other advanced features Who This Book Is For Anyone new to SQL who is looking for step-by-step guidance toward understanding and writing SQL queries. The book is aimed at those who encounter SQL statements often in their work, and provides a sound baseline useful across all SQL database systems. Programmers, database managers, data scientists, and business analysts all can benefit from the baseline of SQL knowledge provided in this book.

Under the 'guise of a discussion about making the leap into a new technology, this bonus mini-episode (hopefully) clears up the on-going confusion about the Kiss Sisters. Moe sat down with her big sister, Michele, to chat about jumping into learning an entirely new skill when time is short, expectations are high, and the learning curve is steep. The specific example they chat about is Michele's dive into Google Analytics data in BigQuery using SQL, but the tips and thoughts are applicable to any new and intimidating platform.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. What You’ll Learn Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing Turbocharge Spark with Alluxio, a distributed in-memory storage platform Deploy big data in the cloud using Cloudera Director Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard Who This Book Is For BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics