Analytics

Joe Dossantos: The Role of Chief Data Officer

2020-04-02 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Joe Dossantos (Qlik)

Big Data Data Governance Qlik

Chief data officers (CDOs) first appeared in enterprise organizations after the Sarbanes Oxley Act became law in the United States in 2002 to improve corporate governance controls. CDOs started with a trickle, but have since become a flood, now populating more than two-thirds of large enterprises, according to a recent survey by NewVantage Partners.

To explore this dynamic role in detail, we invited Joe Dossantos, newly minted CDO for the data and analytics software vendor Qlik. Joe is responsible for data governance, internal data delivery, and self-service enablement. He also evangelizes data and analytics best practices to Qlik customers.

Prior to joining Qlik, Joe led TD Bank’s data strategy, and built and ran the Big Data Consulting Practice for EMC Corporation's Professional Services Organization.

How Much is Bad Analytics Design Costing You? w/ Mico Yuk

2020-04-02 · Analytics on Fire Listen

podcast_episode

by Mico Yuk (Data Storytelling Academy)

BI

Free Live Training Register for our upcoming live training at webinars.bidatastorytelling.com and download our FREE 50-page Analytics Design Guide! In this episode, you'll learn: Design is important. Just take a look at some of the Corona Virus visuals out there. In this fun, short episode I discuss how bad design can affect you in ways you may not know!

[12:09] - "Design is one of those things that keeps the cash flow coming when it comes to BI projects." [15:46] - "You cannot afford to ignore it, no matter how good your data is, if it's ugly, it doesn't matter. " For full show notes, and the links mentioned visit: https://bibrainz.com/podcast/45
Enjoyed the Show? Please leave us a review on iTunes.

Modern Big Data Architectures

2020-03-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dominik Ryzko

AI/ML Big Data Data Analytics Data Science data data-engineering

Provides an up-to-date analysis of big data and multi-agent systems The term Big Data refers to the cases, where data sets are too large or too complex for traditional data-processing software. With the spread of new concepts such as Edge Computing or the Internet of Things, production, processing and consumption of this data becomes more and more distributed. As a result, applications increasingly require multiple agents that can work together. A multi-agent system (MAS) is a self-organized computer system that comprises multiple intelligent agents interacting to solve problems that are beyond the capacities of individual agents. Modern Big Data Architectures examines modern concepts and architecture for Big Data processing and analytics. This unique, up-to-date volume provides joint analysis of big data and multi-agent systems, with emphasis on distributed, intelligent processing of very large data sets. Each chapter contains practical examples and detailed solutions suitable for a wide variety of applications. The author, an internationally-recognized expert in Big Data and distributed Artificial Intelligence, demonstrates how base concepts such as agent, actor, and micro-service have reached a point of convergence—enabling next generation systems to be built by incorporating the best aspects of the field. This book: Illustrates how data sets are produced and how they can be utilized in various areas of industry and science Explains how to apply common computational models and state-of-the-art architectures to process Big Data tasks Discusses current and emerging Big Data applications of Artificial Intelligence Modern Big Data Architectures: A Multi-Agent Systems Perspective is a timely and important resource for data science professionals and students involved in Big Data analytics, and machine and artificial learning.

Open Source Data Pipelines for Intelligent Applications

2020-03-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kyle Bader , Daniel Riek , Sherard Griffin , Pete Brey , Nathan LeClaire

Big Data Data Analytics Kubernetes data data-engineering

For decades, businesses have used information about their customers to make critical decisions on what to stock in inventory, which items to recommend to customers, and when to run promotions. But the advent of big data early in this century changed the game considerably. The key to achieving a competitive advantage today is the ability to process and store ever-increasing amounts of information that affect those decisions. In this report, solutions specialists from Red Hat provide an architectural guide to help you navigate the modern data analytics ecosystem. You’ll learn how the industry has evolved and examine current approaches to storage. That includes a deep dive into the anatomy of a portable data platform architecture, along with several aspects of running data pipelines and intelligent applications with Kubernetes. Explore the history of open source data processing and the evolution of container scheduling Get a concise overview of intelligent applications Learn how to use storage with Kubernetes to produce effective intelligent applications Understand how to structure applications on Kubernetes in your platform architecture Delve into example pipeline architectures for deploying intelligent applications on Kubernetes

Transforming Healthcare Analytics

2020-03-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael N. Lewis , Tho H. Nguyen

Data Management data data-science healthcare-analytics

Real-life examples of how to apply intelligence in the healthcare industry through innovative analytics Healthcare analytics offers intelligence for making better healthcare decisions. Identifying patterns and correlations contained in complex health data, analytics has applications in hospital management, patient records, diagnosis, operating and treatment costs, and more. Helping healthcare managers operate more efficiently and effectively. Transforming Healthcare Analytics: The Quest for Healthy Intelligence shares real-world use cases of a healthcare company that leverages people, process, and advanced analytics technology to deliver exemplary results. This book illustrates how healthcare professionals can transform the healthcare industry through analytics. Practical examples of modern techniques and technology show how unified analytics with data management can deliver insight-driven decisions. The authors—a data management and analytics specialist and a healthcare finance executive—share their unique perspectives on modernizing data and analytics platforms to alleviate the complexity of the healthcare, distributing capabilities and analytics to key stakeholders, equipping healthcare organizations with intelligence to prepare for the future, and more. This book: Explores innovative technologies to overcome data complexity in healthcare Highlights how analytics can help with healthcare market analysis to gain competitive advantage Provides strategies for building a strong foundation for healthcare intelligence Examines managing data and analytics from end-to-end, from diagnosis, to treatment, to provider payment Discusses the future of technology and focus areas in the healthcare industry Transforming Healthcare Analytics: The Quest for Healthy Intelligence is an important source of information for CFO’s, CIO, CTO, healthcare managers, data scientists, statisticians, and financial analysts at healthcare institutions.

DAX Cookbook

2020-03-18 · O'Reilly Data Science Books O'Reilly Amazon

book

by Greg Deckler

BI DAX KPI Power BI SQL analytics-platforms data data-analysis-expressions-dax data-science data analysis expressions (dax) powerpivot

"DAX Cookbook: Over 120 recipes to enhance your business with analytics, reporting, and business intelligence" is the ultimate guidebook for mastering DAX (Data Analysis Expressions) in business intelligence, Power BI, and SQL Server Analysis Services. With hands-on examples and extensive recipes, it enables professionals to solve real-world data challenges effectively. What this Book will help me do Understand how to create tailored calculations for dates, time, and duration to enhance data insights. Develop key performance indicators (KPIs) and advanced business metrics for strategic decision-making. Master text and numerical data transformations to construct dynamic dashboards and reports. Optimize data models and DAX queries for improved performance and analytics accuracy. Learn to handle and debug calculations, and implement complex statistical and mathematical measures. Author(s) Greg Deckler is a seasoned business intelligence professional with extensive experience in using DAX and Power BI to provide actionable insights. As a recognized expert in the field, Greg brings practical knowledge of developing scalable BI solutions. His teaching approach is rooted in clarity and real-world application, making complex topics accessible to learners of all levels. Who is it for? This book is perfect for business professionals, BI developers, and data analysts with basic knowledge of the DAX language and associated tools. If you are looking to enhance your DAX skills and solve tough analytical challenges, this book is tailored for you. It's highly relevant for those aiming to optimize business intelligence workflows and improve data-driven decisions.

Building A New Foundation For CouchDB

2020-03-17 · Data Engineering Podcast Listen

podcast_episode

by Adam Kocoloski , Tobias Macey

AI/ML Big Data ClickHouse Cloud Computing Data Engineering Data Lake Data Management DevOps DWH GitHub IBM Kubernetes +3 more

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB

Interview

Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is?

How did you get involved in the CouchDB project and what is your current role in the community?

What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer?

What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?

How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?

What new capabilities or use cases do you anticipate once this migration is complete?

What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB?

Contact Info

LinkedIn @kocolosk on Twitter kocolosk on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Apache CouchDB FoundationDB

Podcast Episode

IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type

Podcast Episode

Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Implementing and Managing a High-performance Enterprise Infrastructure with Nutanix on IBM Power Systems

2020-03-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dino Quintero , Ricardo Dobelin Barros , Slav Martinski , Gabriel Padilla Jimenez , Alan Verdugo Munoz , Ismael Solis Moreno , Luis Daniel Gonzalez Diaz , Miguel Gomez Gonzalez

Agile/Scrum AI/ML Big Data Cloud Computing IBM data data-engineering ibm-power-systems

This IBM® Redbooks® publication describes how to implement and manage a hyperconverged private cloud solution by using theoretical knowledge, hands-on exercises, and documenting the findings by way of sample scenarios. This book also is a guide about how to implement and manage a high-performance enterprise infrastructure and private cloud platform for big data, artificial intelligence, and transactional and analytics workloads on IBM Power Systems. This book use available documentation, hardware, and software resources to meet the following goals: Document the web-scale architecture that demonstrates the simple and agile nature of public clouds. Showcase the hyperconverged infrastructure to help cloud native applications mine cognitive analytics workloads. Conduct and document implementation case studies. Document guidelines to help provide an optimal system configuration, implementation, and management. This publication addresses topics for developers, IT architects, IT specialists, sellers, and anyone that wants to implement and manage a high-performance enterprise infrastructure and private cloud platform on IBM Power Systems. This book also provides documentation to transfer the how-to-skills to the technical teams, and solution guidance to the sales team. This book compliments any documentation that is available in IBM Knowledge Center, and aligns with the educational materials that are provided by the IBM Systems Software Education (SSE).

Intelligence at the Edge

2020-02-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Harvey

AI/ML IoT SAS Data Streaming analytics-platforms data data-science

Explore powerful SAS analytics and the Internet of Things! The world that we live in is more connected than ever before. The Internet of Things (IoT) consists of mechanical and electronic devices connected to one another and to software through the internet. Businesses can use the IoT to quickly make intelligent decisions based on massive amounts of data gathered in real time from these connected devices. IoT increases productivity, lowers operating costs, and provides insights into how businesses can serve existing markets and expand into new ones. Intelligence at the Edge: Using SAS with the Internet of Things is for anyone who wants to learn more about the rapidly changing field of IoT. Current practitioners explain how to apply SAS software and analytics to derive business value from the Internet of Things. The cornerstone of this endeavor is SAS Event Stream Processing, which enables you to process and analyze continuously flowing events in real time. With step-by-step guidance and real-world scenarios, you will learn how to apply analytics to streaming data. Each chapter explores a different aspect of IoT, including the analytics life cycle, monitoring, deployment, geofencing, machine learning, artificial intelligence, condition-based maintenance, computer vision, and edge devices.

Why are Data communities Important? w/ Allen Hillery

2020-02-27 · Analytics on Fire Listen

podcast_episode

by Mico Yuk (Data Storytelling Academy) , Allen Hillery (Nightingale)

BI DataViz

You know me — I love community! Being a part of the BI community has changed my life and it can change your too for the better if you choose the right community, and understand how to use it to your advantage. Listen and learn.

Today's guest is Allen Hillery, editor of Nightingale, a data visualization society journal. Allen describes why community is important and what you can do to give and take within the community. Recently, he interviewed me and wrote a very popular article on Medium titled, "Mico Yuk on the Importance of Community and the Paradigm Shift in Business Intelligence."

In this episode, you'll learn: [09:25] Allen's Background: Writer, editor, and adjunct professor passionate about storytelling with data. [10:40] Data Business Communities: First, there were not enough, now why there's too many to choose from. [11:03] Priorities Put in Place: Passing of family members led to self-discovery and fulfillment through data storytelling journey. For full show notes, and the links mentioned visit: bibrainz.com/podcast/44 Sponsor The next BI Data Storytelling Mastery Accelerator 3-Day Live workshop is live! Many BI teams are still struggling to deliver consistent, high-engaging analytics their users love. At the end of three days, you'll leave with a clear BI delivery action plan. Register today! Enjoyed the Show? Please leave us a review on iTunes.

#135: Superweek 2020 – the Last Mile of Analytics

2020-02-25 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Josh (Data Driven Strength) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Have you heard the one about the four analysts who run a podcast who walked into a resort in Hungary? Well, now you can! Or, at least get a taste of that experience. Michael, Moe, Tim, and Josh headed to Superweek last month and, among other things, did a 12-hour audio livestream to try to give interested listeners a taste of the experience. On this episode, we're bringing you just over an hour (occasionally, we "power" right past the "hour" mark) of that livestream, centered around (but not limited to!) Michael's presentation on "the last mile of analytics," which is about the importance of self-awareness, communication, and interpersonal skills when it comes to putting analytics into action. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Shining A Light on Shadow IT In Data And Analytics

2020-02-25 · Data Engineering Podcast Listen

podcast_episode

by Charlie Crocker , Sean Knapp (Ascend) , Tobias Macey

AI/ML Big Data Cloud Computing Data Collection Data Engineering Data Lake Data Management DWH Kubernetes Snowplow Data Streaming

Summary Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of shadow IT? What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?

What are some of the roles in an organization that you have seen involved in these shadow IT projects?

What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?

What are some of the pitfalls that these solutions present as a result of their initial ease of use?

What are the benefits to the organization of individuals or teams building and managing their own solutions? What are some of the risks associated with these implementations of data collection, storage, man

Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics

2020-02-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Dan Clark

BI Data Analytics Data Modelling DAX Microsoft Power BI business-intelligence data data-science microsoft-power-platform power-bi

Analyze company data quickly and easily using Microsoft’s powerful data tools. Learn to build scalable and robust data models, clean and combine different data sources effectively, and create compelling and professional visuals. Beginning Power BI is a hands-on, activity-based guide that takes you through the process of analyzing your data using the tools that that encompass the core of Microsoft’s self-service BI offering. Starting with Power Query, you will learn how to get data from a variety of sources, and see just how easy it is to clean and shape the data prior to importing it into a data model. Using Power BI tabular and the Data Analysis Expressions (DAX), you will learn to create robust scalable data models which will serve as the foundation of your data analysis. From there you will enter the world of compelling interactive visualizations to analyze and gain insight into your data. You will wrap up your Power BI journey by learning how to package and share your reports and dashboards with your colleagues. Author Dan Clark takes you through each topic using step-by-step activities and plenty of screen shots to help familiarize you with the tools. This third edition covers the new and evolving features in the Power BI platform and new chapters on data flows and composite models. This book is your hands-on guide to quick, reliable, and valuable data insight. What You Will Learn Simplify data discovery, association, and cleansing Build solid analytical data models Create robust interactive data presentations Combine analytical and geographic data in map-based visualizations Publish and share dashboards and reports Who This Book Is For Business analysts, database administrators, developers, and other professionals looking to better understand and communicate with data

The Guerrilla Approach to your BI Strategy w/ Ankush D'Souza

2020-02-20 · Analytics on Fire Listen

podcast_episode

by Mico Yuk (Data Storytelling Academy) , Ankush D'Souza (Estée Lauder Companies)

BI

I still can't believe what transpired at the recent BI Data Storytelling (BIDS) Accelerator Workshop—simply because we made the exception of letting in a VIP after enrollment had closed. Why? The VIP desperately expressed his need to learn and implement our BI methodology/framework. He was determined to get it done, and the results he shares are amazing.

Today's guest is Ankush D'Souza, Director of BI at Estée Lauder Companies who is going to show you how to deliver your BI Strategy 'Gorilla' style. Ankush describes the four pillars he uses to create a successful BI Strategy, why not having a budget can be a benefit and tactics and tools to focus on. In this episode, you'll learn: [08:56] New Decade, New Year: Time to think about fiscal end-of-year budgets and your BI Strategy. [25:37] Clarification: Mico and Ankush describe differences between BIDF and BIDS. [28:25] Gorilla-style BI Strategy: Ankush is a big believer in sharing knowledge. For full show notes, and the links mentioned visit: bibrainz.com/podcast/43 Sponsor The next BI Data Storytelling Mastery Accelerator 3-Day Live workshop will be held soon. Many BI teams are still struggling to deliver consistent, high-engaging analytics their users love. At the end of three days, you'll leave with a clear BI delivery action plan. Register today! Enjoyed the Show? Please leave us a review on iTunes.

Data Infrastructure Automation For Private SaaS At Snowplow

2020-02-18 · Data Engineering Podcast Listen

podcast_episode

by Josh Beemster (Snowplow) , Tobias Macey

AI/ML Ansible AWS CloudFormation CloudWatch Amazon EMR Kinesis Big Data Cloud Computing Data Engineering Data Management ELK +12 more

Summary One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the components in your system architecture and the nature of your managed service? What are some of the challenges that are inherent to private SaaS nature of your managed service? What elements of your system require the most attention and maintenance to keep them running properly? Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity? How do you manage deployment of the full Snowplow pipeline for your customers?

How has your strategy for deployment evolved since you first began Soffering the managed service? How has the architecture of the pipeline evolved to simplify operations?

How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?

What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?

How does that reflect in the tooling that you use to manage their deployments?

What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow? What are some lessons that you can generalize for management of data infrastructure more broadly? If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently? What do you have planned for the future of the Snowplow product and infrastructure management?

Contact Info

LinkedIn jbeemster on GitHub @jbeemster1 on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Snowplow Analytics

Podcast Episode

Terraform Consul Nomad Meltdown Vulnerability Spectre Vulnerability AWS Kinesis Elasticsearch SnowflakeDB Indicative S3 Segment AWS Cloudwatch Stackdriver Apache Kafka Apache Pulsar Google Cloud PubSub AWS SQS AWS SNS AWS Redshift Ansible AWS Cloudformation Kubernetes AWS EMR

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Estatísticas para Data Science - Data Hackers Podcast 20

2020-02-14 · Data Hackers Listen

podcast_episode

by Luciana Lima (A3Data) , André Calaça (Oper)

Data Science Python

Fala, Data Hackers! Seja bem-vindo a mais um episódio do podcast de Ciência de Dados da maior comunidade de Data Science do Brasil-sil-sil! No episódio de hoje falaremos de uma das melhores amigas da área de dados: a Estatística!

No episódio de hoje, convidamos os Estatísticos Luciana Lima — Head de Analytics na A3Data — e André Calaça — Co-fundador da Oper — para falar sobre como eles trabalham com Data Science, como Cientistas de Dados podem aprender Estatísticas, como Estatísticos podem se tornar Cientistas de Dados, e se Python é melhor que R mesmo.

Understanding Log Analytics at Scale

2020-02-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Matt Gillespie

ELK Splunk data data-engineering log-data

If enabled, logging captures almost every system process, event, or message in your software or hardware. But once you have all that data, what do you do with it? This report shows you how to use log analytics—the process of gathering, correlating, and analyzing that information—to drive critical business insights and outcomes. Drawing on real-world use cases, Matt Gillespie outlines the opportunities for log analytics and the challenges you may face—along with approaches for meeting them. Data architects and IT and infrastructure leads will learn the mechanics of log analytics and key architectural considerations for data storage. The report also offers nine key guideposts that will help you plan and design your own solutions to obtain the full value from your log data. Learn the current state of log analytics and common challenges See how log analytics is helping organizations achieve better business outcomes in areas such as cybersecurity, IT operations, and industrial automation Explore tools for log analytics, including Splunk, the Elastic stack, and Sumo Logic Understand the role storage plays in ensuring successful outcomes

#134: These Are a Few of Our Favorite (Analytics) Tips

2020-02-11 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

"QA and patience and reviews by a peer. Data viz testing, hold no chart too dear. Don't be an a*e; automate 'til it stings. These are a few of our favorite things!" With apologies to Julie Andrews, on this episode, Moe, Tim, and Michael shared some of the tactical tips and techniques that they have found themselves putting to use on a regular basis in their analytics work. The resulting show: multiple tips, minimal disagreements, and moderate laughter. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Data Modeling That Evolves With Your Business Using Data Vault

2020-02-09 · Data Engineering Podcast Listen

podcast_episode

by Kent Graziano (SnowflakeDB) , Tobias Macey

Agile/Scrum AI/ML Big Data ClickHouse Data Engineering Data Management Data Modelling Data Vault DevOps DWH Kubernetes Snowflake +1 more

Summary Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?

What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome? How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?

What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or o

Principles of Managerial Statistics and Data Science

2020-02-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Roberto Rivera

Big Data Data Analytics Data Science DataViz data data-science data-science-tasks statistics

Introduces readers to the principles of managerial statistics and data science, with an emphasis on statistical literacy of business students Through a statistical perspective, this book introduces readers to the topic of data science, including Big Data, data analytics, and data wrangling. Chapters include multiple examples showing the application of the theoretical aspects presented. It features practice problems designed to ensure that readers understand the concepts and can apply them using real data. Over 100 open data sets used for examples and problems come from regions throughout the world, allowing the instructor to adapt the application to local data with which students can identify. Applications with these data sets include: Assessing if searches during a police stop in San Diego are dependent on driver’s race Visualizing the association between fat percentage and moisture percentage in Canadian cheese Modeling taxi fares in Chicago using data from millions of rides Analyzing mean sales per unit of legal marijuana products in Washington state Topics covered in Principles of Managerial Statistics and Data Science include:data visualization; descriptive measures; probability; probability distributions; mathematical expectation; confidence intervals; and hypothesis testing. Analysis of variance; simple linear regression; and multiple linear regression are also included. In addition, the book offers contingency tables, Chi-square tests, non-parametric methods, and time series methods. The textbook: Includes academic material usually covered in introductory Statistics courses, but with a data science twist, and less emphasis in the theory Relies on Minitab to present how to perform tasks with a computer Presents and motivates use of data that comes from open portals Focuses on developing an intuition on how the procedures work Exposes readers to the potential in Big Data and current failures of its use Supplementary material includes: a companion website that houses PowerPoint slides; an Instructor's Manual with tips, a syllabus model, and project ideas; R code to reproduce examples and case studies; and information about the open portal data Features an appendix with solutions to some practice problems Principles of Managerial Statistics and Data Science is a textbook for undergraduate and graduate students taking managerial Statistics courses, and a reference book for working business professionals.

talk-data.com

Activity Trend

Top Events

Top Speakers

Joe Dossantos: The Role of Chief Data Officer

How Much is Bad Analytics Design Costing You? w/ Mico Yuk

Modern Big Data Architectures

Open Source Data Pipelines for Intelligent Applications

Transforming Healthcare Analytics

DAX Cookbook

Building A New Foundation For CouchDB

Implementing and Managing a High-performance Enterprise Infrastructure with Nutanix on IBM Power Systems

Intelligence at the Edge

Why are Data communities Important? w/ Allen Hillery

#135: Superweek 2020 – the Last Mile of Analytics

Shining A Light on Shadow IT In Data And Analytics

Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics

The Guerrilla Approach to your BI Strategy w/ Ankush D'Souza

Data Infrastructure Automation For Private SaaS At Snowplow

Estatísticas para Data Science - Data Hackers Podcast 20

Understanding Log Analytics at Scale

#134: These Are a Few of Our Favorite (Analytics) Tips

Data Modeling That Evolves With Your Business Using Data Vault

Principles of Managerial Statistics and Data Science