talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

4055

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

4055 activities · Newest first

Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

Introduction

How did you get involved in t

Send us a text This week, host Al Martin goes deep with Madhu Kochar and Hemanth Manda, two leaders of product development from the IBM Data and AI team. They discuss the future foundations of digital business -- in particular, the coming age of multicloud and how organizations will contend with data and workloads on cloud systems that span geographies, vendors, and diverse rules for governance. The conversation turns to the need for a data platform that can foster AI initiatives across these diverse environments.


Shownotes:

00:00 - Check us out on YouTube and SoundCloud.  00:10 - Connect with Producer Steve Moore on LinkedIn & Twitter.  00:15 - Connect with Producer Liam Seston on LinkedIn & Twitter.  00:20 - Connect with Producer Rachit Sharma on LinkedIn.  00:25 - Connect with Host Al Martin on LinkedIn & Twitter.  00:40 - Connect with Hemanth Manda on LinkedIn. 00:45 - Connect with Madhu Kochar on LinkedIn.

05:48 – What is Multicloud? 11:38 – Dig into ICP for Data. 18:30 - Learn more about Open API. 32:55 - Check out Stratechery By Ben Thompson.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Mastering Ceph - Second Edition

Mastering Ceph is your comprehensive guide to understanding and deploying Ceph for scalable storage solutions. From planning and design to advanced disaster recovery practices, this book equips you with practical knowledge and hands-on techniques to harness the power of Ceph effectively. What this Book will help me do Design and deploy scalable Ceph clusters tailored to your needs. Optimize Ceph's performance with state-of-the-art tuning techniques. Implement effective disaster recovery strategies for robust storage systems. Extend Ceph's functionality with programming using Librados. Troubleshoot and maintain Ceph to ensure reliability and performance. Author(s) None Fisk is a recognized expert in storage infrastructure. With years of hands-on experience with Ceph and storage systems, None has been involved in numerous successful deployments and performance optimizations. Drawing from real-world scenarios, the author's insights make this guide invaluable for professionals. Who is it for? This book is tailored for storage administrators, cloud engineers, and system administrators aiming to enhance their expertise in storage technologies. Whether you're new to Ceph or looking to deepen your knowledge, the clear examples and practical advice make it a perfect pick.

Send us a text Matthias Funke and Thomas Chu lead the team shaping IBM's strategy for hybrid data management. They join host Al Martin for a deep dive into trends around data management across cloud environments — from private, to public, to multicloud. Matthias and Thomas each started as software developers in the trenches of data and analytics, and they bring that knowledge of fundamentals to their work with large organizations  in the grip of rapid change.


Shownotes

00:00 - Check us out on YouTube and SoundCloud!  00:10 - Connect with Producer Steve Moore on LinkedIn & Twitter  00:15 - Connect with Producer Liam Seston on LinkedIn & Twitter  00:20 - Connect with Producer Rachit Sharma on LinkedIn  00:25 - Connect with Host Al Martin on LinkedIn & Twitter  01:25 - Connect with Matthias Funke on LinkedIn 02:29 - Connect with Thomas Chu on LinkedIn 10:02 - Prepare your data for AI. 19:25 - What is data efficiency?  22:01 - What is cloud computing? Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Python for Data Science For Dummies, 2nd Edition

The fast and easy way to learn Python programming and statistics Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library. Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud. Get started with data science and Python Visualize information Wrangle data Learn from data The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.

IBM DS8880 Architecture and Implementation (Release 8.51)

Abstract * Updated for R8.51 * This IBM® Redbooks® publication describes the concepts, architecture, and implementation of the IBM DS8880 family. The book provides reference information to assist readers who need to plan for, install, and configure the DS8880 systems. The IBM DS8000® family is a high-performance, high-capacity, highly secure, and resilient series of disk storage systems. The DS8880 family is the latest and most advanced of the DS8000 offerings to date. The high availability, multiplatform support, including IBM Z, and simplified management tools help provide a cost-effective path to an on-demand and cloud-based infrastructures. The IBM DS8880 family now offers business-critical, all-flash, and hybrid data systems that span a wide range of price points: DS8882F: Rack Mounted storage system DS8884: Business Class DS8886: Enterprise Class DS8888: Analytics Class The DS8884 and DS8886 are available as either hybrid models, or can be configured as all-flash. Each model represents the most recent in this series of high-performance, high-capacity, flexible, and resilient storage systems. These systems are intended to address the needs of the most demanding clients. Two powerful IBM POWER8® processor-based servers manage the cache to streamline disk I/O, maximizing performance and throughput. These capabilities are further enhanced with the availability of the second generation of high-performance flash enclosures (HPFEs Gen-2) and newer flash drives. Like its predecessors, the DS8880 supports advanced disaster recovery (DR) solutions, business continuity solutions, and thin provisioning. All disk drives in the DS8880 storage system include the Full Disk Encryption (FDE) feature. The DS8880 can automatically optimize the use of each storage tier, particularly flash drives, by using the IBM Easy Tier® feature. Release 8.5 introduces the Safeguarded Copy feature. The DS8882F Rack Mounted is decribed in a separate publication, Introducing the IBM DS8882F Rack Mounted Storage System, REDP-5505.

Summary Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it? What has been your personal experience with deep learning and what set you down that path? What is involved in building a data pipeline and production infrastructure for a deep learning product?

How does that differ from other types of analytics projects such as data warehousing or traditional ML?

For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of? What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate? What are some ways that we can use deep learning as part of the data management process?

How does that shift the infrastructure requirements for our platforms?

Cloud providers have b

Summary Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Alluxio is and the history of the project?

What are some of the use cases that Alluxio enables?

How is Alluxio implemented and how has its architecture evolved over time?

What are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers?

When dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management? What are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms?

What are the tradeoffs that are made to provide a single API across systems with varying capabilities?

Testing and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process?

In order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects?

Can you describe a typical system topology that incorporates Alluxio? For someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies?

What are some edge cases or operational complexities that they should be aware of?

What are some cases where Alluxio is the wrong choice?

What are some projects or products that provide a similar capability to Alluxio?

What do you have planned for the future of the Alluxio project and company?

Contact Info

LinkedIn @binfan on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Alluxio

Project Company

Carnegie Me

IBM Elastic Storage Server Implementation Guide for Version 5.3

This IBM® Redpaper™ publication introduces and describes the IBM Elastic Storage™ Server as a scalable, high-performance data and file management solution. The solution is built on proven IBM Spectrum™ Scale technology, formerly IBM General Parallel File System (GPFS™). IBM Elastic Storage Servers can be implemented for a range of diverse requirements, providing reliability, performance, and scalability. This publication helps you to understand the solution and its architecture and helps you to plan the installation and integration of the environment. The following combination of physical and logical components are required: Hardware Operating system Storage Network Applications This paper provides guidelines for several usage and integration scenarios. Typical scenarios include Cluster Export Services (CES) integration, disaster recovery, and multicluster integration. This paper addresses the needs of technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who must deliver cost-effective cloud services and big data solutions.

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

Introduction

How did you get involved in the area of data management?

I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

Can you start by describing what Open Context is and how it started?

Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

What are your protocols for determining which data sets you will work with?

Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

What are some of the challenges unique to research data?

What are some of the unique requirements for processing, publishing, and archiving research data?

You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

Can you describe the system architecture that you use for Open Context?

Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

Wh

The Google Analytics Suite of products is now part of the Google Marketing Platform. We will cover how key pieces of the Platform can be used including the Salesforce connectors, Display & Video 360, Google Optimize integration, and Google Cloud integrations. We will review how data can be used actionably for advertising, e-mail, personalization, and surveys.

Ceph: Designing and Implementing Scalable Storage Systems

Get to grips with the unified, highly scalable distributed storage system and learn how to design and implement it. Key Features Explore Ceph's architecture in detail Implement a Ceph cluster successfully and gain deep insights into its best practices Leverage the advanced features of Ceph, including erasure coding, tiering, and BlueStore Book Description This Learning Path takes you through the basics of Ceph all the way to gaining in-depth understanding of its advanced features. You'll gather skills to plan, deploy, and manage your Ceph cluster. After an introduction to the Ceph architecture and its core projects, you'll be able to set up a Ceph cluster and learn how to monitor its health, improve its performance, and troubleshoot any issues. By following the step-by-step approach of this Learning Path, you'll learn how Ceph integrates with OpenStack, Glance, Manila, Swift, and Cinder. With knowledge of federated architecture and CephFS, you'll use Calamari and VSM to monitor the Ceph environment. In the upcoming chapters, you'll study the key areas of Ceph, including BlueStore, erasure coding, and cache tiering. More specifically, you'll discover what they can do for your storage system. In the concluding chapters, you will develop applications that use Librados and distributed computations with shared object classes, and see how Ceph and its supporting infrastructure can be optimized. By the end of this Learning Path, you'll have the practical knowledge of operating Ceph in a production environment. This Learning Path includes content from the following Packt products: Ceph Cookbook by Michael Hackett, Vikhyat Umrao and Karan Singh Mastering Ceph by Nick Fisk Learning Ceph, Second Edition by Anthony D'Atri, Vaibhav Bhembre and Karan Singh What you will learn Understand the benefits of using Ceph as a storage solution Combine Ceph with OpenStack, Cinder, Glance, and Nova components Set up a test cluster with Ansible and virtual machine with VirtualBox Develop solutions with Librados and shared object classes Configure BlueStore and see its interaction with other configurations Tune, monitor, and recover storage systems effectively Build an erasure-coded pool by selecting intelligent parameters Who this book is for If you are a developer, system administrator, storage professional, or cloud engineer who wants to understand how to deploy a Ceph cluster, this Learning Path is ideal for you. It will help you discover ways in which Ceph features can solve your data storage problems. Basic knowledge of storage systems and GNU/Linux will be beneficial.

Tableau 2019.x Cookbook

Discover the ultimate guide to Tableau 2019.x that offers over 115 practical recipes to tackle business intelligence and data analysis challenges. This book takes you from the basics to advanced techniques, empowering you to create insightful dashboards, leverage powerful analytics, and seamlessly integrate with modern cloud data platforms. What this Book will help me do Master both basic and advanced functionalities of Tableau Desktop to effectively analyze and visualize data. Understand how to create impactful dashboards and compelling data stories for drive decision-making. Deploy advanced analytical tools including R-based forecasting and statistical techniques with Tableau. Set up and utilize Tableau Server in multi-node environments on Linux and Windows. Utilize Tableau Prep to efficiently clean, shape, and transform data for seamless integration into Tableau workflows. Author(s) The authors of the Tableau 2019.x Cookbook are recognized industry professionals with rich expertise in business intelligence, data analytics, and Tableau's ecosystem. Dmitry Anoshin and his co-authors bring hands-on experience from various industries to provide actionable insights. They focus on delivering practical solutions through structured learning paths. Who is it for? This book is tailored for data analysts, BI developers, and professionals equipped with some knowledge of Tableau wanting to enhance their skills. If you're aiming to solve complex analytics challenges or want to fully utilize the capabilities of Tableau products, this book offers the guidance and knowledge you need.

Summary Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO

Interview

Introduction How did you get involved in the area of data management? My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?

What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data? What was the transition process like, migrating data silos into a uniformly managed platform?

What are the biggest data challenges that you face at LEGO? What are some of the most critical sources and types of data that you are managing? What are the main components of the data infrastructure that you have built to support the organizations analytical needs?

What are some of the technologies that you have found to be most useful? Which have been the most problematic?

What does the team structure look like for the data services at LEGO?

Does that reflect in the types/numbers of systems that you support?

What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support? What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO? How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed? How does the global nature of the LEGO business influence the design strategies and technology choices for your platform? What are you most excited for in the coming year?

Contact Info

Jesper

LinkedIn

Keld

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

LEGO Group ERP (Enterprise Resource Planning) Predictive Analytics Prescriptive Analytics Hadoop Center Of Excellence Continuous Integration Spark

Podcast Episode

Apache NiFi

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

Leverage the numerical and mathematical modules in Python and its standard library as well as popular open source numerical Python packages like NumPy, SciPy, FiPy, matplotlib and more. This fully revised edition, updated with the latest details of each package and changes to Jupyter projects, demonstrates how to numerically compute solutions and mathematically model applications in big data, cloud computing, financial engineering, business management and more. Numerical Python, Second Edition, presents many brand-new case study examples of applications in data science and statistics using Python, along with extensions to many previous examples. Each of these demonstrates the power of Python for rapid development and exploratory computing due to its simple and high-level syntax and multiple options for data analysis. After reading this book, readers will be familiar with many computing techniques including array-based and symbolic computing, visualization and numerical file I/O, equation solving, optimization, interpolation and integration, and domain-specific computational problems, such as differential equation solving, data analysis, statistical modeling and machine learning. What You'll Learn Work with vectors and matrices using NumPy Plot and visualize data with Matplotlib Perform data analysis tasks with Pandas and SciPy Review statistical modeling and machine learning with statsmodels and scikit-learn Optimize Python code using Numba and Cython Who This Book Is For Developers who want to understand how to use Python and its related ecosystem for numerical computing.

Microsoft Power BI Complete Reference

Design, develop, and master efficient Power BI solutions for impactful business insights Key Features Get to grips with the fundamentals of Microsoft Power BI Combine data from multiple sources, create visuals, and publish reports across platforms Understand Power BI concepts with real-world use cases Book Description Microsoft Power BI Complete Reference Guide gets you started with business intelligence by showing you how to install the Power BI toolset, design effective data models, and build basic dashboards and visualizations that make your data come to life. In this Learning Path, you will learn to create powerful interactive reports by visualizing your data and learn visualization styles, tips and tricks to bring your data to life. You will be able to administer your organization's Power BI environment to create and share dashboards. You will also be able to streamline deployment by implementing security and regular data refreshes. Next, you will delve deeper into the nuances of Power BI and handling projects. You will get acquainted with planning a Power BI project, development, and distribution of content, and deployment. You will learn to connect and extract data from various sources to create robust datasets, reports, and dashboards. Additionally, you will learn how to format reports and apply custom visuals, animation and analytics to further refine your data. By the end of this Learning Path, you will learn to implement the various Power BI tools such as on-premises gateway together along with staging and securely distributing content via apps. This Learning Path includes content from the following Packt products: Microsoft Power BI Quick Start Guide by Devin Knight et al. Mastering Microsoft Power BI by Brett Powell What you will learn Connect to data sources using both import and DirectQuery options Leverage built-in and custom visuals to design effective reports Administer a Power BI cloud tenant for your organization Deploy your Power BI Desktop files into the Power BI Report Server Build efficient data retrieval and transformation processes Who this book is for Microsoft Power BI Complete Reference Guide is for those who want to learn and use the Power BI features to extract maximum information and make intelligent decisions that boost their business. If you have a basic understanding of BI concepts and want to learn how to apply them using Microsoft Power BI, then Learning Path is for you. It consists of real-world examples on Power BI and goes deep into the technical issues, covers additional protocols, and much more.

Recent technology developments are driving urgency to modernize data management. What do you do about architecture, modeling, quality, and governance to keep up with big data, cloud, self-service, and other trends in data and technology? Examining some best practices can spark ideas of where to begin.

Originally published at https://www.eckerson.com/articles/stepping-up-to-modern-data-management

Send us a text Paul Zikopolous, VP of big data cognitive systems at IBM, joins us to discuss tactics for both career and personal growth. Paul is also an established author and public speaker, and leverages experiences gained through those pursuits in the advice he gives. Have a pen and paper ready as there is a lot to take away from this enlightening conversation. Show notes 00:00 - Check us out on YouTube. 00:00 - We are now on Soundcloud. 00:10 - Add producer Liam Seston on LinkedIn and Twitter.  00:15 - Add producer Steve Moore on LinkedIn and Twitter. 00:25 - Add host Al Martin on LinkedIn and Twitter.  01:43 - Connect with Paul Zikopolous on LinkedIn and Twitter.  07:02 - Get up to speed with Watson Studio. 10:16 - Develop a continuous learning lifestyle. 14:27 - How to figure out what you want out of a job. 20:55 - How to succeed with failure.  24:50 - "Get comfortable feeling uncomfortable."  30:54 - Here are some tips to make time for the gym. 38:28 - "Don't let other people define you."  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

2017 Data Science Salary Survey

Get a clear picture of the salaries and bonuses data science professionals around the world receive, as well as the tools and cloud providers they use, the tasks they perform, and how interpersonal ("soft") skills might affect their pay. The fifth edition of O’Reilly’s online Data Science Salary Survey provides complete results from nearly 800 participants from 69 different countries, 42 different US states, and Washington, DC. With five years of data, the survey’s results are consistent enough to reliably identify changes and trends. The survey asked specific questions about industry, team, and company size, but also posed questions such as, "How easy is it to move to another position?" or "What is your next career step?" You can plug in your own data points to the survey model and see how you compare to other data science professionals in your industry. With this report, you’ll learn: Where data scientists make the highest salaries—by country and by US state Tools that respondents most commonly use on the job, and tools that contribute most to salary Activities that contribute to higher earnings How gender and bargaining skills affect salaries when all other factors are equal Salary differences between those using open source tools vs those using proprietary tools How the increase in respondents outside of the US signal a rise in international companies starting and growing data organizations Participate in the 2018 Survey: Spend just 5 to 10 minutes and take the anonymous salary survey here: https://www.oreilly.com/ideas/take-the-​data-science-salary-survey.

SQL For Dummies, 9th Edition

Get ready to make SQL easy! Updated for the latest version of SQL, the new edition of this perennial bestseller shows programmers and web developers how to use SQL to build relational databases and get valuable information from them. Covering everything you need to know to make working with SQL easier than ever, topics include how to use SQL to structure a DBMS and implement a database design; secure a database; and retrieve information from a database; and much more. SQL is the international standard database language used to create, access, manipulate, maintain, and store information in relational database management systems (DBMS) such as Access, Oracle, SQL Server, and MySQL. SQL adds powerful data manipulation and retrieval capabilities to conventional languages—and this book shows you how to harness the core element of relational databases with ease. Server platform that gives you choices of development languages, data types, on-premises or cloud, and operating systems Find great examples on the use of temporal data Jump right in—without previous knowledge of database programming or SQL As database-driven websites continue to grow in popularity—and complexity— SQL For Dummies is the easy-to-understand, go-to resource you need to use it seamlessly.