A.I. and the questions you didn't think to ask - Making Data Simple [Season 2 - Episode 40]

2018-12-05 · Making Data Simple Listen

podcast_episode

by Jeff Jonas (Senzing Inc.) , Al Martin (IBM)

AI/ML Big Data IBM

Send us a text In the latest episode of "Making Data Simple," host Al Martin invites Jeff Jonas, CEO, founder and chief scientist at Senzing Inc. to discuss use cases of AI and big data. The discussion ranges from Jeff's personal achievements, his miraculous quadriplegic recovery, his completion of every global Ironman triathlon race, and the birth of his company Senzing Inc. Suit up for what is truly an engaging conversation. Show notes 00:00 - Checkout our YouTube channel. 00:10 - Connect with producer Liam Seston on LinkedIn and Twitter. 00:15 - Connect with producer Steve Moore on LinkedIn and Twitter. 00:24 - Connect with host Al Martin on LinkedIn and Twitter. 01:28 - Connect with guest Jeff Jonas on LinkedIn and Twitter. 02:08 - Not sure what the difference between a triathlon and an Ironman triathlon is? 02:28 - Here's how NORA and other security software applications are being employed in Las Vegas. 13:22 - Here's an interesting article about parent/child naming conventions. 16:26 - Check out Jeff's keynote at IBM Think 2018. 18:55 - Check out these 6 other brands with the "try then buy" sales method. 23:30 - Try out Senzing for yourself at senzing.com. 27:41 - Get an inside look at what it's like to live in a hotel, full-time. 31:49 - Need to brush up on Context Computing? Jeff Jonas explains it here. 33:12 - Check out these 10 Ironman triathlon facts. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

IBM Storage Networking SAN768C-6 Product Guide

2018-12-04 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jon Tate

IBM Fabric data data-engineering

This IBM® Redbooks® Product Guide describes the IBM Storage Networking SAN768C-6. IBM Storage Networking SAN768C-6 has the industry's highest port density for a storage area network (SAN) director and features 768 line-rate 32 gigabits per second (Gbps) or 16 Gbps Fibre Channel ports. Designed to support multiprotocol workloads, IBM Storage Networking SAN768C-6 enables SAN consolidation and collapsed-core solutions for large enterprises, which reduces the number of managed switches and leads to easy-to-manage deployments. IBM Storage Networking SAN768C-6 supports the 48-Port 32 Gbps Fibre Channel Switching Module, the 48-Port 16 Gbps Fibre Channel Switching Module, the 48-port 10 Gbps FCoE Switching Module, the 24-port 40 Gbps FCoE switching module, and the 24/10-port SAN Extension Module. By reducing the number of front-panel ports that are used on inter-switch links (ISLs), it also offers room for future growth. IBM Storage Networking SAN768C-6 addresses the mounting storage requirements of today's large virtualized data centers. As a director-class SAN switch, IBM Storage Networking SAN768C-6 uses the same operating system and management interface as other IBM data center switches. It brings intelligent capabilities to a high-performance, protocol-independent switch fabric, and delivers uncompromising availability, security, scalability, simplified management, and the flexibility to integrate new technologies. You can use IBM Storage Networking SAN768C-6 to transparently deploy unified fabrics with Fibre Channel and Fibre Channel over Ethernet (FCoE) connectivity to achieve low total cost of ownership (TCO). For mission-critical enterprise storage networks that require secure, robust, cost-effective business-continuance services, the FCIP extension module is designed to deliver outstanding SAN extension performance, reducing latency for disk and tape operations with FCIP acceleration features, including FCIP write acceleration and FCIP tape write and read acceleration.

Steve Dine: Modern Cloud Architecture & Mistakes To Avoid When Moving To The Cloud

2018-12-03 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Steve Dine (Datasource Consulting, an EXL company) , Wayne Eckerson (Eckerson Group)

Analytics BI Cloud Computing

In this episode, Wayne Eckerson asks Steve Dine about the approach needed to migrate to the Cloud and architecture required to run analytics in the Cloud. Steve Dine talks extensively about the pitfalls to avoid during Cloud migration and finishes off by saying that even though security is a big issue, most organizations will have part of their architecture in the Cloud during the next two-three years. Steve Dine is a BI and enterprise data consultant and industry thought leader who has extensive experience in designing, delivering and managing highly scalable and maintainable modern data architecture solutions.

Pro Power BI Architecture: Sharing, Security, and Deployment Options for Microsoft Power BI Solutions

2018-11-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by reza rad (RADACAD)

BI Cloud Computing Microsoft Power BI business-intelligence data data-science microsoft-power-platform power-bi

Architect and deploy a Power BI solution. This book will help you understand the many available options and choose the best combination for hosting, developing, sharing, and deploying a Power BI solution within your organization. Pro Power BI Architecture provides detailed examples and explains the different methods available for sharing and securing Power BI content so that only intended recipients can see it. Commonly encountered problems you will learn to handle include content unexpectedly changing while users are in the process of creating reports and building analysis, methods of sharing analyses that don’t cover all the requirements of your business or organization, and inconsistent security models. The knowledge provided in this book will allow you to choose an architecture and deployment model that suits the needs of your organization, ensuring that you do not spend your time maintaining your solution but onusing it for its intended purpose and gaining business value from mining and analyzing your organization’s data. What You'll Learn Architect and administer enterprise-level Power BI solutions Choose the right sharing method for your Power BI solution Create and manage environments for development, testing, and production Implement row level security in multiple ways to secure your data Save money by choosing the right licensing plan Select a suitable connection type—Live Connection, DirectQuery, or Scheduled Refresh—for your use case Set up a Power BI gateway to bridge between on-premises data sources and the Power BI cloud service Who This Book Is For Data analysts, developers, architects, and managers who want to leverage Power BI for their reporting solution

IBM Power Systems E870C and E880C Technical Overview and Introduction

2018-11-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Scott Vetter , Volker Haug , Alexandre Bicas Caldeira

Cloud Computing IBM Linux Marketing Virtual Machine data data-engineering

This IBM® Redpaper™ publication is a comprehensive guide that covers the IBM Power® System E870C (9080-MME) and IBM Power System E880C (9080-MHE) servers that support IBM AIX®, IBM i, and Linux operating systems. The objective of this paper is to introduce the major innovative Power E870C and Power E880C offerings and their relevant functions. The new Power E870C and Power E880C servers with OpenStack-based cloud management and open source automation enables clients to accelerate the transformation of their IT infrastructure for cloud while providing tremendous flexibility during the transition. In addition, the Power E870C and Power E880C models provide clients increased security, high availability, rapid scalability, simplified maintenance, and management, all while enabling business growth and dramatically reducing costs. The systems management capability of the Power E870C and Power E880C servers speeds up and simplifies cloud deployment by providing fast and automated VM deployments, prebuilt image templates, and self-service capabilities, all with an intuitive interface. Enterprise servers provide the highest levels of reliability, availability, flexibility, and performance to bring you a world-class enterprise private and hybrid cloud infrastructure. Through enterprise-class security, efficient built-in virtualization that drives industry-leading workload density, and dynamic resource allocation and management, the server consistently delivers the highest levels of service across hundreds of virtual workloads on a single system. The Power E870C and Power E880C server includes the cloud management software and services to assist with clients' move to the cloud, both private and hybrid. The following capabilities are included: Private cloud management with IBM Cloud PowerVC Manager, Cloud-based HMC Apps as a service, and open source cloud automation and configuration tooling for AIX Hybrid cloud support Hybrid infrastructure management tools Securely connect system of record workloads and data to cloud native applications IBM Cloud Starter Pack Flexible capacity on demand Power to Cloud Services This paper expands the current set of IBM Power Systems™ documentation by providing a desktop reference that offers a detailed technical description of the Power E870C and Power E880C systems. This paper does not replace the latest marketing materials and configuration tools. It is intended as another source of information that, together with existing sources, can be used to enhance your knowledge of IBM server solutions.

Securing SQL Server: DBAs Defending the Database

2018-11-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Peter A Carter

GDPR/CCPA SQL SQL Server data data-engineering microsoft-sql-server relational-databases

Protect your data from attack by using SQL Server technologies to implement a defense-in-depth strategy for your database enterprise. This new edition covers threat analysis, common attacks and countermeasures, and provides an introduction to compliance that is useful for meeting regulatory requirements such as the GDPR. The multi-layered approach in this book helps ensure that a single breach does not lead to loss or compromise of confidential, or business sensitive data. Database professionals in today’s world deal increasingly with repeated data attacks against high-profile organizations and sensitive data. It is more important than ever to keep your company’s data secure. Securing SQL Server demonstrates how developers, administrators and architects can all play their part in the protection of their company’s SQL Server enterprise. This book not only provides a comprehensive guide to implementing the security model in SQLServer, including coverage of technologies such as Always Encrypted, Dynamic Data Masking, and Row Level Security, but also looks at common forms of attack against databases, such as SQL Injection and backup theft, with clear, concise examples of how to implement countermeasures against these specific scenarios. Most importantly, this book gives practical advice and engaging examples of how to defend your data, and ultimately your job, against attack and compromise. What You'll Learn Perform threat analysis Implement access level control and data encryption Avoid non-reputability by implementing comprehensive auditing Use security metadata to ensure your security policies are enforced Mitigate the risk of credentials being stolen Put countermeasures in place against common forms of attack Who This Book Is For Database administrators who need to understand and counteract the threat of attacks against their company’s data, and useful for SQL developers and architects

Microsoft Power BI Dashboards Step by Step, First Edition

2018-11-08 · O'Reilly Data Science Books O'Reilly Amazon

book

by Errin O’Connor

BI Dashboard Microsoft Power BI business-intelligence data data-science microsoft-power-platform power-bi

Your hands-on guide to building effective Power BI dashboards Expand your expertise–and teach yourself how to create world-class Power BI business analysis dashboards that bring data to life for better decision-making. If you're an experienced business intelligence professional or manager, you'll get all the guidance, examples, and code you need to succeed–even if you've never used Power BI before. Successfully design, architect, and implement Power BI in your organization Take full advantage of any Microsoft Power BI platform, including Power BI Premium Make upfront decisions that position your Power BI project for success Build rich, live dashboards to monitor crucial data from across your organization Aggregate data and data elements from numerous internal and external data sources Develop dynamic visualizations, including charts, maps, and graphs Bring data to life with stunning interactive reports Ensure dashboard security and compliance Drive user adoption through effective training

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

2018-11-05 · Data Engineering Podcast Listen

podcast_episode

by Daniel Mintz (Looker) , Tobias Macey

AI/ML Airflow API Athena BI BigQuery Data Engineering Data Management DevOps DWH ETL/ELT Hadoop +10 more

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?

How do you define business intelligence?

How is Looker unique from other approaches to business intelligence in the enterprise?

How does it compare to open source platforms for BI?

Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?

How does that change for different user roles (e.g. data engineer vs sales management)

What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?

What are the portions of the Looker architecture that you would do differently if you were to start over today?

What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Learning Apache Drill

2018-11-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Paul Rogers , Charles Givre

AI/ML Cloud Computing CSV Hadoop Apache HBase HDFS Hive JSON Kafka MongoDB Parquet RDBMS +6 more

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster. In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight. Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drill’s functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning

Pro SQL Server on Linux: Including Container-Based Deployment with Docker and Kubernetes

2018-10-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

Docker Kubernetes Linux Microsoft Oracle SQL SQL Server data data-engineering microsoft-sql-server postgresql relational-databases

Get SQL Server up and running on the Linux operating system and containers. No database professional managing or developing SQL Server on Linux will want to be without this deep and authoritative guide by one of the most respected experts on SQL Server in the industry. Get an inside look at how SQL Server for Linux works through the eyes of an engineer on the team that made it possible. Microsoft SQL Server is one of the leading database platforms in the industry, and SQL Server 2017 offers developers and administrators the ability to run a database management system on Linux, offering proven support for enterprise-level features and without onerous licensing terms. Organizations invested in Microsoft and open source technologies are now able to run a unified database platform across all their operating system investments. Organizations are further able to take full advantage of containerization through popular platforms such as Docker and Kubernetes. Pro SQL Server on Linux walks you through installing and configuring SQL Server on the Linux platform. The author is one of the principal architects of SQL Server for Linux, and brings a corresponding depth of knowledge that no database professional or developer on Linux will want to be without. Throughout this book are internals of how SQL Server on Linux works including an in depth look at the innovative architecture. The book covers day-to-day management and troubleshooting, including diagnostics and monitoring, the use of containers to manage deployments, and the use of self-tuning and the in-memory capabilities. Also covered are performance capabilities, high availability, and disaster recovery along with security and encryption. The book covers the product-specific knowledge to bring SQL Server and its powerful features to life on the Linux platform, including coverage of containerization through Docker and Kubernetes. What You'll Learn Learn about the history and internal of the unique SQL Server on Linux architecture. Install and configure Microsoft’s flagship database product on the Linux platform Manage your deployments using container technology through Docker and Kubernetes Know the basics of building databases, the T-SQL language, and developing applications against SQL Server on Linux Use tools and features to diagnose, manage, and monitor SQL Server on Linux Scale your application by learning the performance capabilities of SQL Server Deliver high availability and disaster recovery to ensure business continuity Secure your database from attack, and protect sensitive data through encryption Take advantage of powerful features such as Failover Clusters, Availability Groups, In-Memory Support, and SQL Server’sSelf-Tuning Engine Learn how to migrate your database from older releases of SQL Server and other database platforms such as Oracle and PostgreSQL Build and maintain schemas, and perform management tasks from both GUI and command line Who This Book Is For Developers and IT professionals who are new to SQL Server and wish to configure it on the Linux operating system. This book is also useful to those familiar with SQL Server on Windows who want to learn the unique aspects of managing SQL Server on the Linux platform and Docker containers. Readers should have a grasp of relational database concepts and be comfortable with the SQL language.

Pervasive Intelligence Now

2018-10-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by Anu Jain

Agile/Scrum Analytics business-intelligence data data-science

  This book looks at strategies to help companies become more intelligent, connected, and agile. It discusses how companies can define and measure high-impact outcomes and use effectively analytics technology to achieve them. It also looks at the technology needed to implement the analytics necessary to achieve high-impact outcomes—from both analytics tool and technical infrastructure perspective. Also discussed are ancillary, but critical, topics such as data security and governance that may not traditionally be a part of analytics discussions but are essential in helping companies maintain a secure environment for their analytics and access the quality data they need to gain critical insights and drive better decision-making.

Data Analytics for IT Networks: Developing Innovative Use Cases, First Edition

2018-10-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by John Garrett

Analytics Data Analytics Data Science Python data data-science

Use data analytics to drive innovation and value throughout your network infrastructure Network and IT professionals capture immense amounts of data from their networks. Buried in this data are multiple opportunities to solve and avoid problems, strengthen security, and improve network performance. To achieve these goals, IT networking experts need a solid understanding of data science, and data scientists need a firm grasp of modern networking concepts. Data Analytics for IT Networks fills these knowledge gaps, allowing both groups to drive unprecedented value from telemetry, event analytics, network infrastructure metadata, and other network data sources. Drawing on his pioneering experience applying data science to large-scale Cisco networks, John Garrett introduces the specific data science methodologies and algorithms network and IT professionals need, and helps data scientists understand contemporary network technologies, applications, and data sources. After establishing this shared understanding, Garrett shows how to uncover innovative use cases that integrate data science algorithms with network data. He concludes with several hands-on, Python-based case studies reflecting Cisco Customer Experience (CX) engineers’ supporting its largest customers. These are designed to serve as templates for developing custom solutions ranging from advanced troubleshooting to service assurance. Understand the data analytics landscape and its opportunities in Networking See how elements of an analytics solution come together in the practical use cases Explore and access network data sources, and choose the right data for your problem Innovate more successfully by understanding mental models and cognitive biases Walk through common analytics use cases from many industries, and adapt them to your environment Uncover new data science use cases for optimizing large networks Master proven algorithms, models, and methodologies for solving network problems Adapt use cases built with traditional statistical methods Use data science to improve network infrastructure analysisAnalyze control and data planes with greater sophistication Fully leverage your existing Cisco tools to collect, analyze, and visualize data

IBM Spectrum Scale Security

2018-09-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alifiya Kantawala , Larry Coyne , Sandeep R Patil , Felipe Knop

Cloud Computing Data Management Hadoop IBM data data-engineering

Storage systems must provide reliable and convenient data access to all authorized users while simultaneously preventing threats coming from outside or even inside the enterprise. Security threats come in many forms, from unauthorized access to data, data tampering, denial of service, and obtaining privileged access to systems. According to the Storage Network Industry Association (SNIA), data security in the context of storage systems is responsible for safeguarding the data against theft, prevention of unauthorized disclosure of data, prevention of data tampering, and accidental corruption. This process ensures accountability, authenticity, business continuity, and regulatory compliance. Security for storage systems can be classified as follows: Data storage (data at rest, which includes data durability and immutability) Access to data Movement of data (data in flight) Management of data IBM® Spectrum Scale is a software-defined storage system for high performance, large-scale workloads on-premises or in the cloud. IBM Spectrum™ Scale addresses all four aspects of security by securing data at rest (protecting data at rest with snapshots, and backups and immutability features) and securing data in flight (providing secure management of data, and secure access to data by using authentication and authorization across multiple supported access protocols). These protocols include POSIX, NFS, SMB, Hadoop, and Object (REST). For automated data management, it is equipped with powerful information lifecycle management (ILM) tools that can help administer unstructured data by providing the correct security for the correct data. This IBM Redpaper™ publication details the various aspects of security in IBM Spectrum Scale™, including the following items: Security of data in transit Security of data at rest Authentication Authorization Hadoop security Immutability Secure administration Audit logging Security for transparent cloud tiering (TCT) Security for OpenStack drivers Unless stated otherwise, the functions that are mentioned in this paper are available in IBM Spectrum Scale V4.2.1 or later releases.

Random Number Generators—Principles and Practices

2018-09-10 · O'Reilly Data Science Books O'Reilly Amazon

book

by David Johnston

data data-science data-science-tasks statistics

Random Number Generators, Principles and Practices has been written for programmers, hardware engineers, and sophisticated hobbyists interested in understanding random numbers generators and gaining the tools necessary to work with random number generators with confidence and knowledge. Using an approach that employs clear diagrams and running code examples rather than excessive mathematics, random number related topics such as entropy estimation, entropy extraction, entropy sources, PRNGs, randomness testing, distribution generation, and many others are exposed and demystified. If you have ever Wondered how to test if data is really random Needed to measure the randomness of data in real time as it is generated Wondered how to get randomness into your programs Wondered whether or not a random number generator is trustworthy Wanted to be able to choose between random number generator solutions Needed to turn uniform random data into a different distribution Needed to ensure the random numbers from your computer will work for your cryptographic application Wanted to combine more than one random number generator to increase reliability or security Wanted to get random numbers in a floating point format Needed to verify that a random number generator meets the requirements of a published standard like SP800-90 or AIS 31 Needed to choose between an LCG, PCG or XorShift algorithm Then this might be the book for you.

Malware Data Science

2018-09-04 · O'Reilly Data Science Books O'Reilly Amazon

book

by Hillary Sanders , Joshua Saxe

AI/ML Big Data Data Science DataViz malware

"Security has become a ""big data"" problem. The growth rate of malware has accelerated to tens of millions of new files per year while our networks generate an ever-larger flood of security-relevant data each day. In order to defend against these advanced attacks, you'll need to know how to think like a data scientist. In Malware Data Science, security data scientist Joshua Saxe introduces machine learning, statistics, social network analysis, and data visualization, and shows you how to apply these methods to malware detection and analysis. You'll learn how to: • Analyze malware using static analysis• Observe malware behavior using dynamic analysis• Identify adversary groups through shared code analysis• Catch 0-day vulnerabilities by building your own machine learning detector• Measure malware detector accuracy• Identify malware campaigns, trends, and relationships through data visualization Whether you're a malware analyst looking to add skills to your existing arsenal, or a data scientist interested in attack detection and threat intelligence, Malware Data Science will help you stay ahead of the curve."

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

2018-09-03 · Data Engineering Podcast Listen

podcast_episode

by Mark Marinelli (Tamr) , Tobias Macey

Agile/Scrum AI/ML API BI Cloud Computing Data Engineering Data Management Data Science ERP Master Data Management

Summary

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms

Interview

Introduction How did you get involved in the area of data management? Can you start by establishing a definition of data mastering that we can work from?

How does the master data set get used within the overall analytical and processing systems of an organization?

What is the traditional workflow for creating a master data set?

What has changed in the current landscape of businesses and technology platforms that makes that approach impractical? What are the steps that an organization can take to evolve toward an agile approach to data mastering?

At what scale of company or project does it makes sense to start building a master data set? What are the limitations of using ML/AI to merge data sets? What are the limitations of a golden master data set in practice?

Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them? Are there specific problem domains that are more likely to benefit from a master data set?

Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) What storage mechanisms are typically used for managing a master data set?

Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure? How do you manage latency issues when trying to reference the same entities from multiple disparate systems?

What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?

What suggestions do you have to help prevent such a project from being derailed?

What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of

Location-Based Services Handbook

2018-09-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mohammad Ilyas , Syed Ahson

data data-engineering geographic-information-system-gis geographic information system (gis) location-data

Meeting the demands of the rapid growth of wireless Internet subscribers and the development of the world location-based services (LBS) market, this volume introduces and comprehensively discusses various location based applications such as buddy finder and proximity and security services. The material is organized in three major sections: applications, technologies, and security. Written by experts from across the globe, the articles in each of the sections range from basic concepts to research grade material and include discussions of future directions. An extensive bibliography is included with each chapter.

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

2018-08-27 · Data Engineering Podcast Listen

podcast_episode

by Ellison Anne Williams (Enveil) , Tobias Macey

API Data Engineering Data Management ELK GDPR/CCPA Hadoop Spark

Summary

There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use

Interview

Introduction How did you get involved in the area of data security? Can you start by explaining what your mission is with Enveil and how the company got started? One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?

What are some of the challenges associated with scaling homomorphic encryption? What are some difficulties associated with working on encrypted data sets?

Can you describe the underlying architecture for your data platform?

How has that architecture evolved from when you first began building it?

What are some use cases that are unlocked by having a fully encrypted data platform? For someone using the Enveil platform, what does their workflow look like? A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors? What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns) What do you have planned for the future of Enveil?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data security today?

Links

Enveil NSA GDPR Intellectual Property Zero Trust Homomorphic Encryption Ciphertext Hadoop PII (Personally Identifiable Information) TLS (Transport Layer Security) Spark Elasticsearch Side-channel attacks Spectre and Meltdown

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Future of Public Sector and Jobs in #BigData World

2018-08-23 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Donald F. Kettl (Lyndon B. Johnson School of Public Affairs, University of Texas at Austin)

Analytics Big Data

In this podcast, Don Kettl, Professor, LBJ School, the University of Texas at Austin, talks about the future of the public sector in the mid of data and analytics capability disruptions. Don talked about some of the biggest opportunities in the public policy space. He sheds light on how the future public policy officers would design the organizations that grow with time. He sheds light on the future of jobs in the public sector and how data could disrupt the space to increase its impact. This session is great for people interested in learning about public sector data and jobs impact through big data evolution.

TIMELINE: 0:28 Don's journey. 5:16 Premise of "Little bites of big data policy". 7:16 Data in the government sector. 11:18 Example of good data framework in state governments. 13:49 The need for good cooperation between the private and public sectors. 17:56 Opportunities for data in the public sector. 21:37 The failure of data in the public sector. 27:54 Perspective on open data. 33:58 Future of data in the public sector. 41:42 The role of government in data businesses. 48:58 Can government data policies go global? 55:56 Don's success mantra. 59:43 Don's reading list. 1:01:30 How does Don avoid bias? 1:07:00 Key takeaways.

Don's Book: Little Bites of Big Data for Public Policy by Donald F Kettl amzn.to/2zfpKDn Politics of the Administrative Process by Donald F Kettl amzn.to/2KS34KY and more at: amzn.to/2u12gg8

Podcast Link: https://futureofdata.org/future-of-public-sector-and-jobs-in-bigdata-world-futureofdata-podcast/

Don's BIO: Donald F. Kettl is a professor at the Lyndon B. Johnson School of Public Affairs at the University of Texas at Austin. He is also a nonresident senior fellow at the Volcker Alliance and the Brookings Institution.

Kettl is the author or editor of numerous books, including Can Governments Earn Our Trust? (2017); Little Bites of Big Data for Public Policy (2017); The Politics of the Administrative Process (7th edition, 2017). Three of his books have received national best-book awards. The Transformation of Governance (2002); and System under Stress: Homeland Security and American Politics (2005) and Escaping Jurassic Government: How to Recover America’s Lost Commitment to Competence.

He has received three-lifetime achievement awards: the American Political Science Association’s John Gaus Award, the Warner W. Stockberger Achievement Award of the International Public Management Association, and the Donald C. Stone Award of the American Society for Public Administration, for significant contributions to the field of intergovernmental relations.

Kettl holds a Ph.D. in political science from Yale University. Before his appointment at the University of Maryland, he taught at the University of Pennsylvania, Columbia University, the University of Virginia, Vanderbilt University, and the University of Wisconsin-Madison. He is a fellow of Phi Beta Kappa and the National Academy of Public Administration.

He has appeared frequently in national and international media, including National Public Radio, the Fox News Channel, Good Morning America, ABC World News Tonight, NBC Nightly News, CBS Evening News, CNN’s “Anderson Cooper 360” and “The Situation Room,” the Huffington Post, as well as public television’s News Hour and the BBC.

Kettl is a shareholder of the Green Bay Packers, along with his wife, Sue.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ analyticsweek.com/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Putting Airflow Into Production With James Meickle - Episode 43

2018-08-13 · Data Engineering Podcast Listen

podcast_episode

by James Meickle , Tobias Macey

Airflow Ansible API Astronomer AWS CloudFormation AWS Glue Data Engineering Data Management Data Science DevOps ETL/ELT +7 more

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

talk-data.com

Cyber Security

Activity Trend

Top Events

Top Speakers

A.I. and the questions you didn't think to ask - Making Data Simple [Season 2 - Episode 40]

IBM Storage Networking SAN768C-6 Product Guide

Steve Dine: Modern Cloud Architecture & Mistakes To Avoid When Moving To The Cloud

Pro Power BI Architecture: Sharing, Security, and Deployment Options for Microsoft Power BI Solutions

IBM Power Systems E870C and E880C Technical Overview and Introduction

Securing SQL Server: DBAs Defending the Database

Microsoft Power BI Dashboards Step by Step, First Edition

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Learning Apache Drill

Pro SQL Server on Linux: Including Container-Based Deployment with Docker and Kubernetes

Pervasive Intelligence Now

Data Analytics for IT Networks: Developing Innovative Use Cases, First Edition

IBM Spectrum Scale Security

Random Number Generators—Principles and Practices

Malware Data Science

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Location-Based Services Handbook

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Future of Public Sector and Jobs in #BigData World

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Putting Airflow Into Production With James Meickle - Episode 43