Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

2018-11-05 · Data Engineering Podcast Listen

podcast_episode

by Daniel Mintz (Looker) , Tobias Macey

AI/ML Airflow API Athena BI BigQuery Data Engineering Data Management DWH ETL/ELT Hadoop Hive +10 more

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?

How do you define business intelligence?

How is Looker unique from other approaches to business intelligence in the enterprise?

How does it compare to open source platforms for BI?

Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?

How does that change for different user roles (e.g. data engineer vs sales management)

What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?

What are the portions of the Looker architecture that you would do differently if you were to start over today?

What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

2018-10-04 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Maksim Pecherskiy (City of San Diego)

Agile/Scrum AI/ML Analytics API Big Data CI/CD Data Science

In this podcast, Maksim, CDO @ City of San Diago, discussed the nuances of running big data for big cities. He shares his perspectives on effectively building a central data office in a complex and extremely collaborative environment like a big city. He shared his thoughts on some ways to effectively prioritize which project to pursue. He shared how leadership and execution could blend to solve civic issues relating to big and small cities. A great practitioner podcast for folks seeking to build a robust data science practice across a large and collaborative ecosystem.

Timeline: 0:28 Maksim's journey. 6:45 Maksim's current role. 11:46 Collaboration process in creating a data inventory. 14:52 Working with the bureaucracy. 18:35 Dealing with unforeseen circumstances at work. 20:22 Prioritization at work. 22:58 Qualities of a good data leader. 26:15 Collaboration with other cities. 27:40 Cool data projects in other cities. 30:55 Shortcomings of other city representatives. 36:54 Use cases in AI 39:00 What would Maksim change about himself? 40:50 Future cities and data 43:55 Opportunities for private investors in the public sector. 45:53 Maksim's success mantra. 50:19 Closing remark.

Maksim's Book Recommendation: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr, George Spafford amzn.to/2MAu5Xv

Podcast Link: https://futureofdata.org/understanding-bigdata-for-bigcities-with-maksim-mrmaksimize-cityofsandiego-futureofdata-podcast/

Maksim's BIO: Maksim Pecherskiy: As the CDO for the City of San Diego, working in the Performance & Analytics Department, Maksim strives to bring the necessary components together to allow the City's residents to benefit from a more efficient, agile government that is as innovative as the community around it. He has been solving complex problems with technology for nearly a decade. He spent 2014 working as a Code For America fellow in Puerto Rico, focusing on economic development. His team delivered a product called PrimerPeso that provides business owners and residents a tool to search, and apply for, government programs for which they may be eligible.

Before moving to California, Maksim was a Solutions Architect at Promet Source in Chicago, where he built large web applications and designed complex integrations. He shaped workflow, configuration management, and continuous integration processes while leading and training international development teams. Before his work at Promet, he was a software engineer at AllPlayers, who was instrumental in the design and architecture of its APIs and the development and documentation of supporting client libraries in various languages.

Maksim graduated from DePaul University with a bachelor of science degree in information systems and from Linköping University, Sweden, with a bachelor of science degree in international business. He is also certified as a Lean Six Sigma Green Belt.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

Putting Airflow Into Production With James Meickle - Episode 43

2018-08-13 · Data Engineering Podcast Listen

podcast_episode

by James Meickle , Tobias Macey

Airflow Ansible API Astronomer AWS CloudFormation AWS Glue Data Engineering Data Management Data Science ETL/ELT GitHub +7 more

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Mastering Kibana 6.x

2018-07-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Anurag Srivastava

AI/ML Analytics Big Data Data Analytics ELK Kibana Cyber Security data data-science data-science-tasks data-visualization

Mastering Kibana 6.x is your guide to leveraging Kibana for creating impactful data visualizations and insightful dashboards. From setting up basic visualizations to exploring advanced analytics and machine learning integrations, this book equips you with the necessary skills to dive deep into your data and gain actionable insights at scale. You'll also learn to effectively manage and monitor data with powerful tools such as X-Pack and Beats. What this Book will help me do Build sophisticated dashboards to visualize elastic stack data effectively. Understand and utilize Timelion expressions for analyzing time series data. Incorporate X-Pack capabilities to enhance security and monitoring in Kibana. Extract, analyze, and visualize data from Elasticsearch for advanced analytics. Set up monitoring and alerting using Beats components for reliable data operations. Author(s) With extensive experience in big data technologies, the author brings a practical approach to teaching advanced Kibana topics. Having worked on real-world data analytics projects, their aim is to make complex concepts accessible while showing how to tackle analytics challenges using Kibana. Who is it for? This book is ideal for data engineers, DevOps professionals, and data scientists who want to optimize large-scale data visualizations. If you're looking to manage Elasticsearch data through insightful dashboards and visual analytics, or enhance your data operations with features like machine learning, then this book is perfect for you. A basic understanding of the Elastic Stack is helpful, though not required.

Dev Ops for Data Science

2018-07-11 · Data Skeptic Listen

podcast_episode

by Kyle Polich , Damien Brady (Microsoft) , Donovan Brown (Microsoft) , Paige Bailey (Google)

AI/ML CI/CD Cloud Computing Data Science Git Microsoft

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases. For a data scientist, what does it even mean to "build"? Packaging and deployment are things that a data scientist doesn't normally have to consider in their day-to-day work. The process of making an AI app is usually divided into two streams of work: data scientists building machine learning models and app developers building the application for end users to consume. DevOps includes all the parties involved in getting the application deployed and maintained and thinking about all the phases that follow and precede their part of the end solution. So what does DevOps mean for data science? Why should you adopt DevOps best practices? In the first half, Paige and Damian share their views on what DevOps for data science would look like and how it can be introduced to provide continuous integration, delivery, and deployment of data science models. In the second half, Donovan and Damian talk about the DevOps life cycle of putting a database under version control and carrying out deployments through a release pipeline.

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

2018-07-08 · Data Engineering Podcast Listen

podcast_episode

by Andy LoPresto , Kevin Doran , Tobias Macey

Agile/Scrum Airflow Flink API Chef CSV Data Engineering Data Governance Data Management Dataflow DataOps Docker +13 more

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project?

Where does it sit in the broader landscape of data tools?

Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?

How do you manage versioning and backup of data flows, as well as promoting them between environments?

One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?

What types of reporting are available across this information?

What are some of the use cases or requirements that lend themselves well to being solved by NiFi?

When is NiFi the wrong choice?

What is involved in deploying and scaling a NiFi installation?

What are some of the system/network parameters that should be considered? What are the scaling limitations?

What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi?

Contact Info

Kevin Doran

@kevdoran on Twitter Email

Andy LoPresto

@yolopey on Twitter Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Designing Event-Driven Systems

2018-05-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ben Stopford

API Kafka Data Streaming data data-engineering streaming-messaging

Many forces affect software today: larger datasets, geographical disparities, complex company structures, and the growing need to be fast and nimble in the face of change. Proven approaches such as service-oriented and event-driven architectures are joined by newer techniques such as microservices, reactive architectures, DevOps, and stream processing. Many of these patterns are successful by themselves, but as this practical ebook demonstrates, they provide a more holistic and compelling approach when applied together. Author Ben Stopford explains how service-based architectures and stream processing tools such as Apache Kafka can help you build business-critical systems. You’ll learn how to apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with microservices and SOA using patterns such as "inside out databases" and "event streams as a source of truth." These approaches provide a unique foundation for how these large, autonomous service ecosystems can communicate and share data. Learn why streaming beats request-response based architectures in complex, contemporary use cases Understand why replayable logs such as Kafka provide a backbone for both service communication and shared datasets Explore how event collaboration and event sourcing patterns increase safety and recoverability with functional, event-driven approaches Build service ecosystems that blend event-driven and request-driven interfaces using a replayable log and Kafka’s Streams API Scale beyond individual teams into larger, department- and company-sized architectures, using event streams as a source of truth

Defining DataOps with Chris Bergh - Episode 26

2018-04-08 · Data Engineering Podcast Listen

podcast_episode

by Christopher Bergh (DataKitchen) , Tobias Macey

Agile/Scrum Analytics API Data Analytics Data Engineering Data Management Datadog DataOps Informatica Talend

Summary

Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps

Interview

Introduction How did you get involved in the area of data management? How do you define DataOps?

How does it compare to the practices encouraged by the DevOps movement? How does it relate to or influence the role of a data engineer?

How does a DataOps oriented workflow differ from other existing approaches for building data platforms? One of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments? The practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system? One of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal?

In order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments?

How does the DataKitchen platform simplify the process of operationalizing a data analytics workflow? As the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed?

Contact Info

LinkedIn @ChrisBergh on Twitter Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DataOps Manifesto DataKitchen 2017: The Year Of DataOps Air Traffic Control Chief Data Officer (CDO) Gartner W. Edwards Deming DevOps Total Quality Management (TQM) Informatica Talend Agile Development Cattle Not Pets IDE (Integrated Devel

Database Refactoring Patterns with Pramod Sadalage - Episode 22

2018-03-12 · Data Engineering Podcast Listen

podcast_episode

by Pramod Sadalage , Tobias Macey

Agile/Scrum CI/CD Data Engineering Data Management Docker DWH GitHub Java Linux MongoDB Neo4j NoSQL +1 more

Summary

As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow

Interview

Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution?

How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years?

Contact Info

Website pramodsadalage on GitHub @pramodsadalage on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Database Refactoring

Website Book

Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration

The Book Wikipedia

Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Teams with Will McGinnis - Episode 19

2018-02-19 · Data Engineering Podcast Listen

podcast_episode

by Will McGinnis , Tobias Macey

AI/ML Data Engineering Data Management Data Science GitHub Linux Scikit-learn

Summary

The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists

Interview

Introduction How did you get involved in the area of data management? The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms? What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators? Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team? What are the benefits of splitting the responsibilities of data engineering and data science?

What are the disadvantages?

What are some strategies to ensure successful interaction between data engineers and data scientists? How do you view these roles evolving as they become more prevalent across companies and industries?

Contact Info

Website wdm0006 on GitHub @willmcginniser on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Blog Post: Tendencies of Data Engineers and Data Scientists Predikto Categorical Encoders DevOps SciKit-Learn

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

2018-02-11 · Data Engineering Podcast Listen

podcast_episode

by Mike Freedman (Timescale) , Ajay Kulkarni (Timescale) , Tobias Macey

Amazon RDS Azure Cloud Computing Cloudflare Data Engineering Data Management Databricks Docker ELK GCP GitHub Grafana +14 more

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it?

What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?

For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down?

Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?

What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale?

Contact Info

Ajay

LinkedIn @acoustik on Twitter Timescale Blog

Mike

Website LinkedIn @michaelfreedman on Twitter Timescale Blog

Timescale

Website @timescaledb on Twitter GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data

Website Data Engineering Podcast Interview

Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss" target="_blank"…

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

2017-11-14 · Data Engineering Podcast Listen

podcast_episode

by Walter Menendez (BuzzFeed) , Tobias Macey

Analytics AWS Amazon EMR CI/CD Cloud Computing Data Engineering Data Management Datadog GCP GitHub Google Analytics Linux +6 more

Summary

Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed

Interview

Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success?

Contact Info

@hackwalter on Twitter walterm on GitHub

Links

Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The Biml Book: Business Intelligence and Data Warehouse Automation

2017-10-30 · O'Reilly Business Intelligence Books O'Reilly Amazon

book

by Cathrine Wilhelmsen , Simon Peck , Reeves Smith , Benjamin Weissman , Bill Fellows , Scott Currie , Peter Avenant , Andy Leonard , Jacob Alley , Raymond Sondak , Martin Andersson

BI DWH Microsoft SQL SSAS SSIS business-intelligence data data-science

Learn Business Intelligence Markup Language (Biml) for automating much of the repetitive, manual labor involved in data integration. We teach you how to build frameworks and use advanced Biml features to get more out of SQL Server Integration Services (SSIS), Transact-SQL (T-SQL), and SQL Server Analysis Services (SSAS) than you ever thought possible. The first part of the book starts with the basics—getting your development environment configured, Biml syntax, and scripting essentials. Whether a beginner or a seasoned Biml expert, the next part of the book guides you through the process of using Biml to build a framework that captures both your design patterns and execution management. Design patterns are reusable code blocks that standardize the approach you use to perform certain types of data integration, logging, and other key data functions. Design patterns solve common problems encountered when developing data integration solutions. Because you do not have to build the code from scratch each time, design patterns improve your efficiency as a Biml developer. In addition to leveraging design patterns in your framework, you will learn how to build a robust metadata store and how to package your framework into Biml bundles for deployment within your enterprise. In the last part of the book, we teach you more advanced Biml features and capabilities, such as SSAS development, T-SQL recipes, documentation autogeneration, and Biml troubleshooting. The Biml Book: Provides practical and applicable examples Teaches you how to use Biml to reduce development time while improving quality Takes you through solutions to common data integration and BI challenges What You'll Learn Master the basics of Business Intelligence Markup Language (Biml) Study patterns for automating SSIS package generation Build a Biml Framework Import and transform database schemas Automate generation of scripts and projects Who This Book Is For BI developers wishing to quickly locate previously tested solutions, Microsoft BI specialists, those seeking more information about solution automation and code generation, and practitioners of Data Integration Lifecycle Management (DILM) in the DevOps enterprise

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

2017-08-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Markus Oscheka , Axel Westphal , Bert Dufrasne

Cloud Computing Data Management IBM SAP data data-engineering

Data is the currency of the new economy, and organizations are increasingly tasked with finding better ways to protect, recover, access, share, and use it. IBM Spectrum™ Copy Data Management is aimed at using existing data in a manner that is efficient, automated, scalable. It helps you manage all of those snapshot and IBM FlashCopy® images made to support DevOps, data protection, disaster recovery, and Hybrid Cloud computing environments. This IBM® Redpaper™ publication specifically addresses IBM Spectrum Copy Data Management in combination with IBM FlashSystem® A9000 or A9000R when used for Automated Disaster Recovery of SAP HANA.

Essentials of Cloud Application Development on IBM Bluemix

2017-08-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Hala Aziz , Ahmed Azraq , Sally Fikry , Ben Smith , Mohamed El-Khouly , Ahmed S. Hassan

Agile/Scrum API Cloud Computing Computer Science Dashboard Git IBM JavaScript JSON data data-engineering

Abstract This IBM® Redbooks® publication is based on the Presentations Guide of the course Essentials of Cloud Application Development on IBM Bluemix that was developed by the IBM Redbooks team in partnership with IBM Skills Academy Program. This course is designed to teach university students the basic skills that are required to develop, deploy, and test cloud-based applications that use the IBM Bluemix® cloud services. The primary target audience for this course is university students in undergraduate computer science and computer engineer programs with no previous experience working in cloud environments. However, anyone new to cloud computing can also benefit from this course. After completing this course, you should be able to accomplish the following tasks: Define cloud computing Describe the factors that lead to the adoption of cloud computing Describe the choices that developers have when creating cloud applications Describe infrastructure as a service, platform as a service, and software as a service Describe IBM Bluemix and its architecture Identify the runtimes and services that IBM Bluemix offers Describe IBM Bluemix infrastructure types Create an application in IBM Bluemix Describe the IBM Bluemix dashboard, catalog, and documentation features Explain how the application route is used to test an application from the browser Create services in IBM Bluemix Describe how to bind services to an application in IBM Bluemix Describe the environment variables that are used with IBM Bluemix services Explain what are IBM Bluemix organizations, domains, spaces, and users Describe how to create an IBM SDK for Node.js application that runs on IBM Bluemix Explain how to manage your IBM Bluemix account with the Cloud Foundry CLI Describe how to set up and use the IBM Bluemix plug-in for Eclipse Describe the role of Node.js for server-side scripting Describe IBM Bluemix DevOps Services and the capabilities of IBM DevOps Services Identify the Web IDE features in IBM Bluemix DevOps Describe how to connect a Git repository client to Bluemix DevOps Services project Explain the pipeline build and deploy processes that IBM Bluemix DevOps Services use Describe how IBM Bluemix DevOps Services integrate with the IBM Bluemix cloud Describe the agile planning tools in IBM Bluemix Describe the characteristics of REST APIs Explain the advantages of the JSON data format Describe an example of REST APIs using Watson Describe the main types of data services in IBM Bluemix Describe the benefits of IBM Cloudant® Explain how Cloudant databases and documents are accessed from IBM Bluemix Describe how to use REST APIs to interact with Cloudant database Describe Bluemix mobile backend as a service (MBaaS) and the MBaaS architecture Describe the Push Notifications service Describe the App ID service Describe the Kinetise service Describe how to create Bluemix Mobile applications by using MobileFirst Services Starter Boilerplate The workshop materials were created in June 2017. Therefore, all IBM Bluemix features that are described in this Presentations Guide and IBM Bluemix user interfaces that are used in the examples are current as of June 2017.

Building Custom Tasks for SQL Server Integration Services

2017-07-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Leonard

ETL/ELT Microsoft SQL SSIS data data-engineering microsoft-sql-server relational-databases

Learn to build custom SSIS tasks using Visual Studio Community Edition and Visual Basic. Bring all the power of Microsoft .NET to bear on your data integration and ETL processes, and for no added cost over what you’ve already spent on licensing SQL Server. If you already have a license for SQL Server, then you do not need to spend more money to extend SSIS with custom tasks and components. Why are custom components necessary? Because even though the SSIS catalog of built-in tasks and components is a marvel of engineering, there do remain gaps in the functionality that is provided. These gaps are especially relevant to enterprises practicing Data Integration Lifecycle Management (DILMS) and/or DevOps. One of the gaps is a limitation of the SSIS Execute Package task. Developers using the stock version of that task are unable to select SSIS packages from other projects. Yet it’s useful to be able to select and execute tasks across projects, and the example used throughout this book will help you to create an Execute Catalog Package task that does in fact allow you to execute a task from another project. Building on the example’s pattern, you can create any task that you like, custom tailored to your specific, data integration and ETL needs. What You Will Learn Configure and execute Visual Studio in the way that best supports SSIS task development Create a class library as the basis for an SSIS task, and reference the needed SSIS assemblies Properly sign assemblies that you create in order to invoke them from your task Implement source code control via Visual Studio Team Services, or your own favorite tool set Code not only your tasks themselves, but also the associated task editors Troubleshoot and then execute your custom tasks as part of your own project Who This Book Is For Database administrators and developers who are involved in ETL projects built around SQL Server Integration Services (SSIS). Readers should have a background in programming along with a desire to optimize their ETL efforts by creating custom-tailored tasks for execution from SSIS packages.

Creating a Data-Driven Enterprise with DataOps

2017-03-15 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ashish Thusoo , Joydeep Sen Sarma

Big Data Cloud Computing DataOps Hive data data-science

Many companies are busy collecting massive amounts of data, but few are taking advantage of this treasure horde to build a truly data insights-driven organization. To do so, the data team must democratize both data and the insights in a way that provides real-time access to all employees in the organization. This report explores DataOps, the process, culture, tools, and people required to scale big data pervasively across the enterprise. Just as DevOps has enabled organizations to improve coordination between developers and the operations team, DataOps closely connects everyone who handles data, including engineers, data scientists, analysts, and business users. Democratizing data with this approach requires removing barriers typical of siloed data, teams, and systems. In this report, Apache Hive creators Ashish Thusoo and Joydeep Sen Sarma examine the characteristics of a data-driven organization that supports a self-service model. Explore related topics such as data lakes, metadata, cloud architecture, and data-infrastructure-as-a-service Examine conclusions from a survey of more than 400 senior executives whose companies are in various stages of data maturity Learn how data pioneers at Facebook, Uber, LinkedIn, Twitter, and eBay created data-driven cultures and self-service data infrastructures for their organizations

Defining Data Engineering with Maxime Beauchemin - Episode 3

2017-03-05 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

Airflow Beam Data Engineering Data Management Data Modelling Datadog Druid GitHub Hive Luigi

Summary

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

Transcript provided by CastSource

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin

Questions

Introduction How did you get involved in the field of data engineering? How do you define data engineering and how has that changed in recent years? Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen? For someone who wants to get started in the field of data engineering what are some of the necessary skills? What do you see as the biggest challenges facing data engineers currently? At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain? How much analytical knowledge is necessary for a typical data engineer? What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality? You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice? How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices? How do you see the role of data engineers evolving in the next few years?

Keep In Touch

@mistercrunch on Twitter mistercrunch on GitHub Medium

Links

Datadog Airflow The Rise of the Data Engineer Druid.io Luigi Apache Beam Samza Hive Data Modeling

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Introducing The Show

2017-01-08 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

Data Engineering Data Management

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers. I’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer.

Interview

Who am I Systems administrator and software engineer, now DevOps, focus on automation Host of Podcast.init How did I get involved in data management Why am I starting a podcast about Data Engineering Interesting area with a lot of activity Not currently any shows focused on data engineering What kinds of topics do I want to cover Data stores Pipelines Tooling Automation Monitoring Testing Best practices Common challenges Defining the role/job hunting Relationship with data engineers/data analysts Get in touch and subscribe Website Newsletter Twitter Email

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

IBM Business Process Manager Operations Guide

2016-12-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bryan Brown , Weiming Gu , Karri Carlson-Neumann , Dave Spriet , Chris Richardson , Shuo Zhang , Mark Filley

Cloud Computing IBM data data-engineering

This IBM® Redbooks® publication provides operations teams with architectural design patterns and guidelines for the day-to-day challenges that they face when managing their IBM IBM Business Process Manager (BPM) infrastructure. Today, IBM BPM L2 and L3 Support and SWAT teams are constantly advising customers how to deal with the following common challenges: Deployment options (on-premises, patterns, cloud, and so on) Administration DevOps Automation Performance monitoring and tuning Infrastructure management Scalability High Availability and Data Recovery Federation This publication enables customers to become self-sufficient, promote consistency and accelerate IBM BPM Support engagements. This IBM Redbooks publication is targeted toward technical professionals (technical support staff, IT Architects, and IT Specialists) who are responsible for meeting day-to-day challenges that they face when they are managing an IBM BPM infrastructure.

talk-data.com

DevOps

Activity Trend

Top Events

Top Speakers

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Putting Airflow Into Production With James Meickle - Episode 43

Mastering Kibana 6.x

Dev Ops for Data Science

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Designing Event-Driven Systems

Defining DataOps with Chris Bergh - Episode 26

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Data Teams with Will McGinnis - Episode 19

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

The Biml Book: Business Intelligence and Data Warehouse Automation

Using IBM Spectrum Copy Data Management with IBM FlashSystem A9000 or A9000R and SAP HANA

Essentials of Cloud Application Development on IBM Bluemix

Building Custom Tasks for SQL Server Integration Services

Creating a Data-Driven Enterprise with DataOps

Defining Data Engineering with Maxime Beauchemin - Episode 3

Introducing The Show

IBM Business Process Manager Operations Guide