Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

2017-11-14 · Data Engineering Podcast Listen

podcast_episode

by Walter Menendez (BuzzFeed) , Tobias Macey

Analytics AWS Amazon EMR CI/CD Cloud Computing Data Engineering Datadog DevOps GCP GitHub Google Analytics Linux +6 more

Summary

Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed

Interview

Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success?

Contact Info

@hackwalter on Twitter walterm on GitHub

Links

Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

GeoServer Beginner's Guide - Second Edition

2017-10-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Stefano Iacovella

API GIS data data-engineering geographic-information-system-gis geographic information system (gis) location-data

GeoServer Beginner's Guide is your starting point for mastering GeoServer, a powerful open-source tool for serving geospatial data online. This book makes it easy to create, manage, and share maps and geographic information systems (GIS) even if you don't have advanced technical experience. With step-by-step guidance, you'll leverage GeoServer's full capabilities. What this Book will help me do Configure and install GeoServer to publish your geospatial data quickly and efficiently, making it available online. Create interactive and visually appealing maps by styling points, lines, and polygons using GeoServer's tools. Learn how to connect GeoServer with back-end databases like PostGIS for advanced data management and functionalities. Optimize GeoServer for performance and prepare for production-ready deployments, ensuring a seamless user experience. Use GeoServer's REST API to automate tasks and integrate with other applications for enhanced workflows. Author(s) None Iacovella has extensive experience in GIS and web technologies, specializing in open-source solutions. With a passion for teaching, he has authored several books and tutorials that make technical topics accessible to developers and enthusiasts. His approachable writing style ensures that complex concepts are broken down into understandable steps. Who is it for? This book is designed for web developers and technical users who are new to GeoServer or open-source GIS tools. Ideal readers are those with basic server-side scripting knowledge and an interest in publishing dynamic, interactive maps. If you're looking to enhance your website with geospatial data, this guide will provide the step-by-step instructions you need.

Learning Neo4j 3.x - Second Edition

2017-10-20 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jerome Baton , Rik Van Bruggen

Data Modelling Neo4j Cyber Security data data-engineering graph-databases

"Learning Neo4j 3.x" provides a comprehensive introduction to the world of graph databases, focusing on practical usage of Neo4j. This book guides you through the fundamentals, from installation and modeling to advanced features including security and optimization. You'll gain the skills to harness Neo4j for effective data management and visualization. What this Book will help me do Understand the basics of graph databases and how to use them effectively in real-world scenarios. Master the Cypher query language to query and manipulate graph data powerfully and intuitively. Learn to implement and optimize advanced graph techniques using the APOC library. Develop the ability to extend Neo4j's core functionality using available plugins and advanced extensions. Acquire skills to design and deploy scalable, secure enterprise-grade graph database solutions. Author(s) Jerome Baton and None Van Bruggen are experienced Neo4j specialists who share a passion for making complex technical concepts accessible. Jerome brings years of real-world experience in graph database applications, while None contributes expertise in data modeling and visualization. Together, they provide clear, focused insights with practical examples and hands-on guidance. Who is it for? This book is tailored for developers looking to extend their knowledge with graph databases to take on modern connected data challenges. It is suitable for those new to Neo4j, including beginners with databases, and will serve as a valuable guide for professionals aiming to deepen their expertise in data storage and query optimization using Neo4j.

Data Management and Analysis Using JMP

2017-10-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jane Oppenlander , Patricia Schaffer

analytics-platforms data data-science jmp

A holistic, step-by-step approach to analyzing health care data! Written for both beginner and intermediate JMP users working in or studying health care, Data Management and Analysis Using JMP: Health Care Case Studies bridges the gap between taking traditional statistics courses and successfully applying statistical analysis in the workplace. Authors Jane Oppenlander and Patricia Schaffer begin by illustrating techniques to prepare data for analysis, followed by presenting effective methods to summarize, visualize, and analyze data. The statistical analysis methods covered in the book are the foundational techniques commonly applied to meet regulatory, operational, budgeting, and research needs in the health care field. This example-driven book shows practitioners how to solve real-world problems by using an approach that includes problem definition, data management, selecting the appropriate analysis methods, step-by-step JMP instructions, and interpreting statistical results in context. Practical strategies for selecting appropriate statistical methods, remediating data anomalies, and interpreting statistical results in the domain context are emphasized. The cases presented in Data Management and Analysis Using JMP use multiple statistical methods. A progression of methods--from univariate to multivariate--is employed, illustrating a logical approach to problem-solving. Much of the data used in these cases is open source and drawn from a variety of health care settings. The book offers a welcome guide to working professionals as well as students studying statistics in health care-related fields.

Learning Ceph - Second Edition

2017-10-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Anthony D'Atri , Vaibhav Bhembre , Karan Singh

Cloud Computing Linux ceph data data-engineering

Dive into 'Learning Ceph' to master Ceph, the powerful open-source storage solution known for its scalability and reliability. By following the book's clear instructions, you'll be equipped to deploy, configure, and integrate Ceph into your infrastructure for exabyte-scale data management. What this Book will help me do Understand the architectural principles of Ceph and its uses. Gain practical skills in deploying and managing a Ceph cluster. Learn to monitor and troubleshoot Ceph systems effectively. Explore integration possibilities with OpenStack and other platforms. Apply advanced techniques like erasure coding and CRUSH map optimization. Author(s) The authors are experienced software engineers and open-source contributors with deep expertise in storage systems and distributed computing. They bring practical, real-world examples and accessible explanations to complex topics like Ceph architecture and operation. Their passion for empowering professionals with robust technical skills shines through in this book. Who is it for? This book is ideal for system administrators, cloud engineers, or storage professionals looking to expand their knowledge of software-defined storage solutions. Whether you're new to Ceph or seeking advanced tips for optimization, this guide has something for every skill level. Prerequisite knowledge includes familiarity with Linux and server architecture concepts.

Learn FileMaker Pro 16: The Comprehensive Guide to Building Custom Databases

2017-09-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mark Conway Munro

JSON SQL data data-engineering filemaker

Extend FileMaker's built-in functionality and totally customize your data management environment with specialized functions and menus to super-charge the results and create a truly unique and focused experience. This book includes everything a beginner needs to get started building databases with FileMaker and contains advanced tips and techniques that the most seasoned professionals will appreciate. Written by a long time FileMaker developer, this book contains material for developers of every skill level. FileMaker Pro 16 is a powerful database development application used by millions of people in diverse industries to simplify data management tasks, leverage their business information in new ways and automate many mundane tasks. A custom solution built with FileMaker can quickly tap into a powerful set of capabilities and technologies to offer users an intuitive and pleasing environment in which to achieve new levels of efficiency and professionalism. What You’ll learn Create SQL queries to build fast and efficient formulas Discover new features of version 16 such as JSON functions, Cards, Layout Object window, SortValues, UniqueValues, using variables in Data Sources Write calculations using built-in and creating your own custom functions Discover the importance of a good approach to interface and technical design Apply best practices for naming conventions and usage standards Explore advanced topics about designing professional, open-ended solutions and using advanced techniques Who This Book Is For Casual programmers, full time consultants and IT professionals.

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

2017-08-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Markus Oscheka , Axel Westphal , Bert Dufrasne

Cloud Computing DevOps IBM SAP data data-engineering

Data is the currency of the new economy, and organizations are increasingly tasked with finding better ways to protect, recover, access, share, and use it. IBM Spectrum™ Copy Data Management is aimed at using existing data in a manner that is efficient, automated, scalable. It helps you manage all of those snapshot and IBM FlashCopy® images made to support DevOps, data protection, disaster recovery, and Hybrid Cloud computing environments. This IBM® Redpaper™ publication specifically addresses IBM Spectrum Copy Data Management in combination with IBM FlashSystem® A9000 or A9000R when used for Automated Disaster Recovery of SAP HANA.

Learning Informatica PowerCenter 10.x - Second Edition

2017-08-10 · O'Reilly Data Science Books O'Reilly Amazon

book

by Rahul Malewar

DWH ETL/ELT Informatica analytics-platforms data data-science

Dive into the world of Informatica PowerCenter 10.x, where enterprise data warehousing meets cutting-edge data management solutions. This comprehensive guide walks you through mastering ETL processes and optimizing performance, helping you become proficient in this powerful data integration tool. With step-by-step instructions, you'll build your knowledge from installation to advanced techniques. What this Book will help me do Understand how to install and configure Informatica PowerCenter 10.x for enterprise-level data integration projects, ensuring readiness to start transforming data effectively. Gain hands-on experience with PowerCenter's various developer tools, including Workflow Manager, Workflow Monitor, Designer, and Repository Manager, mastering their practical utilities. Learn and apply essential data warehousing concepts, such as Slowly Changing Dimensions (SCDs) and Incremental Aggregations, to create robust data-handling workflows. Leverage advanced PowerCenter features like pushdown optimization and partitioning to optimize performance for large-scale data processing jobs. Become proficient in migrating sources, targets, and workflows between environments, enabling seamless integration of data management solutions across enterprise systems. Author(s) Rahul Malewar, a seasoned expert in ETL and data integration, brings his extensive experience with Informatica PowerCenter to the table. With years spent working alongside global enterprises to streamline their data operations, Rahul's insights transfer into a hands-on teaching style that simplifies even the most advanced concepts. Apt at bridging technical depth with accessible explanations, he has dedicated his career to empowering learners to unlock the full potential of their data warehousing tools. Who is it for? Perfect for developers and data professionals aiming to elevate their enterprise data management skills, this book is ideally suited for those new to or experienced with Informatica. Whether you're striving to become proficient in PowerCenter or seeking to implement advanced ETL concepts in your projects, this guide will equip you with the expertise to succeed. A foundational understanding of programming and data warehousing concepts is recommended for best results.

Astronomer with Ry Walker - Episode 6

2017-08-06 · Data Engineering Podcast Listen

podcast_episode

by Ry Walker (Astronomer) , Tobias Macey

Airflow Flink API Astronomer AWS Kinesis AWS Lambda Data Engineering Docker Druid ELK Grafana +12 more

Summary

Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.

Interview

Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of?

Contact Information

Email @rywalker on Twitter

Links

Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

#FutureOfData with Robin Thottungal, Chief Data Scientist at EPA

2017-07-13 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Robin Thottungal (EPA)

Analytics Big Data Data Analytics Data Science

In this podcast, Robin discussed how an analytics organization functions in a collaborative culture. He shed some light on preparing a robust framework while working on policy rich setup. This talk is a must for anyone building an analytics organization with a culture-rich or policy rich environment.

Timeline: 0:29 Robin's journey. 6:02 Challenges in working as a chief data scientist. 9:50 Two breeds of data scientists. 13:38 Introducing data science into large companies. 16:57 Creating a center of excellence with data. 19:52 Challenges in working with a government agency. 22:57 Creating a self-serving system. 26:29 Defining chief data officer, chief analytics officer, chief data scientist. 28:28 Designing an architecture for a rapidly changing company culture. 31:39 Future of analytics and data leaders. 35:47 Art of doing business and science of doing business. 42:26 Perfect data science hire. 45:08 Closing remarks.

Podcast link: https://futureofdata.org/futureofdata-with-robin-thottungal-chief-data-scientist-at-epa/

Here's Robin's bio on his current EPA Role: - Leading the Data Analytics effort of 15,000+ member agency through providing strategic vision, program development, evangelizing the value of data-driven decision making, bringing a lean-start up approach to the public sector & building advanced data analytics platform capable of real-time/batch analysis.

-Serving as Chief data scientist for the agency, including directing, coordinating, and overseeing the division’s leadership of EPA’s multimedia data analytics, visualization, and predictive analysis work along with related tools, application development, and services.

-Develop and oversee the implementation of Agency policy on integration analysis of environmental data, including multimedia analysis and assessments of environmental quality, status, and trends.

-Develop, market, and implement tactical and strategic plans for the Agency’s data management, advanced data analytics, and predictive analysis work.

-Lead crossfederal, state, tribal, and local government data partnerships as well as information partnerships with other entities.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Building on Multi-Model Databases

2017-07-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pete Aven

API Cyber Security data data-engineering data-models

In many organizations today, businesspeople are busy requesting unified views of data stored across multiple sources within their organizations. But integrating multiple data types from multiple data stores is a complex, error-prone, and time-consuming process of cobbling everything together manually. This concise book examines how multi-model databases can help you integrate data storage and access across your organization in a seamless and elegant way. Author Pete Aven and Diane Burley from MarkLogic explain how this latest evolution in data management naturally accepts heterogeneous data, enabling you to eventually phase out technical data silos. Through several case studies, you’ll discover how organizations use multi-model databases to reduce complexity, save money, take advantage of opportunities, lessen risk, and shorten time to value. Get unified views across disparate data models and formats within a single database Learn how multi-model databases leverage the inherent structure of the data being stored Load and use unstructured and semi-structured data (such as documents and text) as is Provide agility in data access and delivery through APIs, interfaces, and indexes Learn how to scale a multi-model database, and provide ACID capabilities and security Examine how a multi-model database would fit into your existing architecture

Implementing OpenStack SwiftHLM with IBM Spectrum Archive EE or IBM Spectrum Protect for Space Management

2017-06-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dominic Mller-Wicke , Larry Coyne , Khanh Ngo , Slavisa Sarafijanovic , Simon Lorenz , Harald Seipp , Takeshi Ishimoto

IBM data data-engineering

The Swift High Latency Media project seeks to create a high-latency storage back end that makes it easier for users to perform bulk operations of data tiering within a Swift data ring. In today's world, data is produced at significantly higher rates than a decade ago. The storage and data management solutions of the past can no longer keep up with the data demands of today. The policies and structures that decide and execute how that data is used, discarded, or retained determines how efficiently the data is used. The need for intelligent data management and storage is more critical now than ever before. Traditional management approaches hide cost-effective, high-latency media (HLM) storage, such as tape or optical disk archive back ends, underneath a traditional file system. The lack of HLM-aware file system interfaces and software makes it difficult for users to understand and control data access on HLM storage. Coupled with data-access latency, this lack of understanding results in slow responses and potential time-outs that affect the user experience. The Swift HLM project addresses this challenge. Running OpenStack Swift on top of HLM storage allows you to cheaply store and efficiently access large amounts of infrequently used object data. Data that is stored on tape storage can be easily adopted to an Object Storage data interface. This IBM® Redpaper™ publication describes the Swift High Latency Media project and provides guidance for installation and configuration.

R: Mining Spatial, Text, Web, and Social Media Data

2017-06-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by Pradeepta Mishra , Richard Heimann , Nathan Danneman , Bater Makhabel

Hadoop data data-science data-science-tools r

Create data mining algorithms About This Book Develop a strong strategy to solve predictive modeling problems using the most popular data mining algorithms Real-world case studies will take you from novice to intermediate to apply data mining techniques Deploy cutting-edge sentiment analysis techniques to real-world social media data using R Who This Book Is For This Learning Path is for R developers who are looking to making a career in data analysis or data mining. Those who come across data mining problems of different complexities from web, text, numerical, political, and social media domains will find all information in this single learning path. What You Will Learn Discover how to manipulate data in R Get to know top classification algorithms written in R Explore solutions written in R based on R Hadoop projects Apply data management skills in handling large data sets Acquire knowledge about neural network concepts and their applications in data mining Create predictive models for classification, prediction, and recommendation Use various libraries on R CRAN for data mining Discover more about data potential, the pitfalls, and inferencial gotchas Gain an insight into the concepts of supervised and unsupervised learning Delve into exploratory data analysis Understand the minute details of sentiment analysis In Detail Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: Learning Data Mining with R by Bater Makhabel R Data Mining Blueprints by Pradeepta Mishra Social Media Mining with R by Nathan Danneman and Richard Heimann Style and approach A complete package with which will take you from the basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

2017-06-18 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Justin Cunningham (Yelp)

Flink Avro Beam BI Data Engineering ETL/ELT JSON JSON Schema Kafka Linux Protobuf Redshift

Summary

Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline

Interview with Justin Cunningham

Introduction How did you get involved in the area of data engineering? Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for? What are some of the dead ends that you experienced while designing and implementing your pipeline? As you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house? What are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them? What are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy? While you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced? Did you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public? What advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project?

Keep in touch

Yelp Engineering Blog Email

Links

Kafka Redshift ETL Business Intelligence Change Data Capture LinkedIn Data Bus Apache Storm Apache Flink Confluent Apache Avro Game Days Chaos Monkey Simian Army PaaSta Apache Mesos Marathon SignalFX Sensu Thrift Protocol Buffers JSON Schema Debezium Kafka Connect Apache Beam

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Lake for Enterprises

2017-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pankaj Misra , Tomcy John , Vivek Mishra

AI/ML AWS Lambda Big Data Data Lake Hadoop Java Kafka Spark data data-engineering data-lake storage-repositories

"Data Lake for Enterprises" is a comprehensive guide to building data lakes using the Lambda Architecture. It introduces big data technologies like Hadoop, Spark, and Flume, showing how to use them effectively to manage and leverage enterprise-scale data. You'll gain the skills to design and implement data systems that handle complex data challenges. What this Book will help me do Master the use of Lambda Architecture to create scalable and effective data management systems. Understand and implement technologies like Hadoop, Spark, Kafka, and Flume in an enterprise data lake. Integrate batch and stream processing techniques using big data tools for comprehensive data analysis. Optimize data lakes for performance and reliability with practical insights and techniques. Implement real-world use cases of data lakes and machine learning for predictive data insights. Author(s) None Mishra, None John, and Pankaj Misra are recognized experts in big data systems with a strong background in designing and deploying data solutions. With a clear and methodical teaching style, they bring years of experience to this book, providing readers with the tools and knowledge required to excel in enterprise big data initiatives. Who is it for? This book is ideal for software developers, data architects, and IT professionals looking to integrate a data lake strategy into their enterprises. It caters to readers with a foundational understanding of Java and big data concepts, aiming to advance their practical knowledge of building scalable data systems. If you're eager to delve into cutting-edge technologies and transform enterprise data management, this book is for you.

Exam Ref 70-761 Querying Data with Transact-SQL, 1st Edition

2017-04-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Itzik Ben-Gan

Azure Cloud Computing JSON Microsoft SQL XML data data-engineering microsoft-sql-server relational-databases transact-sql

Prepare for Microsoft Exam 70-761–and help demonstrate your real-world mastery of SQL Server 2016 Transact-SQL data management, queries, and database programming. Designed for experienced IT professionals ready to advance their status, Exam Ref focuses on the critical-thinking and decision-making acumen needed for success at the MCSA level. Focus on the expertise measured by these objectives: Filter, sort, join, aggregate, and modify data Use subqueries, table expressions, grouping sets, and pivoting Query temporal and non-relational data, and output XML or JSON Create views, user-defined functions, and stored procedures Implement error handling, transactions, data types, and nulls This Microsoft Exam Ref: Organizes its coverage by exam objectives Features strategic, what-if scenarios to challenge you Assumes you have experience working with SQL Server as a database administrator, system engineer, or developer Includes downloadable sample database and code for SQL Server 2016 SP1 (or later) and Azure SQL Database Querying Data with Transact-SQL About the Exam Exam 70-761 focuses on the skills and knowledge necessary to manage and query data and to program databases with Transact-SQL in SQL Server 2016. About Microsoft Certification Passing this exam earns you credit toward a Microsoft Certified Solutions Associate (MCSA) certification that demonstrates your mastery of essential skills for building and implementing on-premises and cloud-based databases across organizations. Exam 70-762 (Developing SQL Databases) is also required for MCSA: SQL 2016 Database Development certification. See full details at: microsoft.com/learning

ScyllaDB with Eyal Gutkind - Episode 4

2017-03-18 · Data Engineering Podcast Listen

podcast_episode

by Eyal Gutkind , Tobias Macey

API Cassandra Data Engineering GitHub

Summary

If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB

Interview

Introduction How did you get involved in the area of data management? What is ScyllaDB and why would someone choose to use it? How do you ensure sufficient reliability and accuracy of the database engine? The large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that? Deployment and tuning What challenges are inroduced as a result of needing to maintain API compatibility with a diferent product? Do you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up? Are there any issues with compatibility of plugins for CassandraDB running on Scylla? For someone who wants to deploy and tune Scylla, what are the steps involved? Is it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap? What prompted the decision to form a company around the database? What are some other uses of Seastar?

Keep in touch

Eyal

ScyllaDB

Website @ScyllaDB on Twitter GitHub Mailing List Slack

Links

Seastar Project DataStax XFS TitanDB OpenTSDB KairosDB CQL Pedis

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Defining Data Engineering with Maxime Beauchemin - Episode 3

2017-03-05 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

Airflow Beam Data Engineering Data Modelling Datadog DevOps Druid GitHub Hive Luigi

Summary

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

Transcript provided by CastSource

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin

Questions

Introduction How did you get involved in the field of data engineering? How do you define data engineering and how has that changed in recent years? Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen? For someone who wants to get started in the field of data engineering what are some of the necessary skills? What do you see as the biggest challenges facing data engineers currently? At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain? How much analytical knowledge is necessary for a typical data engineer? What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality? You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice? How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices? How do you see the role of data engineers evolving in the next few years?

Keep In Touch

@mistercrunch on Twitter mistercrunch on GitHub Medium

Links

Datadog Airflow The Rise of the Data Engineer Druid.io Luigi Apache Beam Samza Hive Data Modeling

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

#057: Open Data with Brett Hurt and Jon Loyens

2017-02-28 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Jon Loyens (HomeAway.com / Bazaarvoice) , Moe Kiss (Canva) , Michael Helbling (Search Discovery) , Brett Hurt (Bazaarvoice)

GitHub

So, knowledge management and data management walked into a bar and bumped into Github. The result? Open data and, specifically, data.world! Coremetrics...and then Bazaarvoice founder Brett Hurt, along with Homeaway.com and Bazaarvoice veteran Jon Loyens, joined us to talk about what open data is, why it's gaining traction, and why we all should care. And, if you've been pining to have us record an episode that runs for more than an hour, this one is it! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

QGIS: Becoming a GIS Power User

2017-02-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Mandel , Víctor Olaya Ferrero , Ben Mearns , Alexander Bruy , Anita Graser (Austrian Institute of Technology)

GIS Master Data Management Python data data-engineering geographic-information-system-gis geographic information system (gis) location-data

Master data management, visualization, and spatial analysis techniques in QGIS and become a GIS power user About This Book Learn how to work with various types of data and create beautiful maps using this easy-to-follow guide Give a touch of professionalism to your maps, both for functionality and look and feel, with the help of this practical guide This progressive, hands-on guide builds on a geo-spatial data and adds more reactive maps using geometry tools. Who This Book Is For If you are a user, developer, or consultant and want to know how to use QGIS to achieve the results you are used to from other types of GIS, then this learning path is for you. You are expected to be comfortable with core GIS concepts. This Learning Path will make you an expert with QGIS by showing you how to develop more complex, layered map applications. It will launch you to the next level of GIS users. What You Will Learn Create your first map by styling both vector and raster layers from different data sources Use parameters such as precipitation, relative humidity, and temperature to predict the vulnerability of fields and crops to mildew Re-project vector and raster data and see how to convert between different style formats Use a mix of web services to provide a collaborative data system Use raster analysis and a model automation tool to model the physical conditions for hydrological analysis Get the most out of the cartographic tools to in QGIS to reveal the advanced tips and tricks of cartography In Detail The first module Learning QGIS, Third edition covers the installation and configuration of QGIS. You'll become a master in data creation and editing, and creating great maps. By the end of this module, you'll be able to extend QGIS with Python, getting in-depth with developing custom tools for the Processing Toolbox. The second module QGIS Blueprints gives you an overview of the application types and the technical aspects along with few examples from the digital humanities. After estimating unknown values using interpolation methods and demonstrating visualization and analytical techniques, the module ends by creating an editable and data-rich map for the discovery of community information. The third module QGIS 2 Cookbook covers data input and output with special instructions for trickier formats. Later, we dive into exploring data, data management, and preprocessing steps to cut your data to just the important areas. At the end of this module, you will dive into the methods for analyzing routes and networks, and learn how to take QGIS beyond the out-of-the-box features with plug-ins, customization, and add-on tools. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: Learning QGIS, Third Edition by Anita Graser QGIS Blueprints by Ben Mearns QGIS 2 Cookbook by Alex Mandel, Víctor Olaya Ferrero, Anita Graser, Alexander Bruy Style and approach This Learning Path will get you up and running with QGIS. We start off with an introduction to QGIS and create maps and plugins. Then, we will guide you through Blueprints for geographic web applications, each of which will teach you a different feature by boiling down a complex workflow into steps you can follow. Finally, you'll turn your attention to becoming a QGIS power user and master data management, visualization, and spatial analysis techniques of QGIS. Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

talk-data.com

Data Management

Activity Trend

Top Events

Top Speakers

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

GeoServer Beginner's Guide - Second Edition

Learning Neo4j 3.x - Second Edition

Data Management and Analysis Using JMP

Learning Ceph - Second Edition

Learn FileMaker Pro 16: The Comprehensive Guide to Building Custom Databases

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

Learning Informatica PowerCenter 10.x - Second Edition

Astronomer with Ry Walker - Episode 6

#FutureOfData with Robin Thottungal, Chief Data Scientist at EPA

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Building on Multi-Model Databases

Implementing OpenStack SwiftHLM with IBM Spectrum Archive EE or IBM Spectrum Protect for Space Management

R: Mining Spatial, Text, Web, and Social Media Data

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Data Lake for Enterprises

Exam Ref 70-761 Querying Data with Transact-SQL, 1st Edition

ScyllaDB with Eyal Gutkind - Episode 4

Defining Data Engineering with Maxime Beauchemin - Episode 3

#057: Open Data with Brett Hurt and Jon Loyens

QGIS: Becoming a GIS Power User