Cloud Computing

IBM TS4500 R5 Tape Library Guide

2018-12-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Michael Engelbrecht Larry Coyne Simon Browne, Illarion Borisevich, Robert Beiderbeck

ELK IBM Cyber Security data data-engineering

Abstract The IBM® TS4500 (TS4500) tape library is a next-generation tape solution that offers higher storage density and integrated management than previous solutions. This IBM Redbooks® publication gives you a close-up view of the new IBM TS4500 tape library. In the TS4500, IBM delivers the density that today’s and tomorrow’s data growth requires. It has the cost-effectiveness and the manageability to grow with business data needs, while you preserve existing investments in IBM tape library products. Now, you can achieve both a low cost per terabyte (TB) and a high TB density per square foot because the TS4500 can store up to 11 petabytes (PB) of uncompressed data in a single frame library or scale up to 2 PB per square foot to over 350 PB. The TS4500 offers the following benefits: High availability: Dual active accessors with integrated service bays reduce inactive service space by 40%. The Elastic Capacity option can be used to completely eliminate inactive service space. Flexibility to grow: The TS4500 library can grow from the right side and the left side of the first L frame because models can be placed in any active position. Increased capacity: The TS4500 can grow from a single L frame up to another 17 expansion frames with a capacity of over 23,000 cartridges. High-density (HD) generation 1 frames from the TS3500 library can be redeployed in a TS4500. Capacity on demand (CoD): CoD is supported through entry-level, intermediate, and base-capacity configurations. Advanced Library Management System (ALMS): ALMS supports dynamic storage management, which enables users to create and change logical libraries and configure any drive for any logical library. Support for IBM TS1160 while also supporting TS1155, TS1150, and TS1140 tape drive: The TS1160 gives organizations an easy way to deliver fast access to data, improve security, and provide long-term retention, all at a lower cost than disk solutions. The TS1160 offers high-performance, flexible data storage with support for data encryption. Also, this enhanced fifth-generation drive can help protect investments in tape automation by offering compatibility with existing automation. The new TS1160 Tape Drive Model 60E delivers a dual 10 Gb or 25 Gb Ethernet host attachment interface that is optimized for cloud-based and hyperscale environments. The TS1160 Tape Drive Model 60F delivers a native data rate of 400 MBps, the same load/ready, locate speeds, and access times as the TS1155, and includes dual-port 16 Gb Fibre Channel support. Support of the IBM Linear Tape-Open (LTO) Ultrium 8 tape drive: The LTO Ultrium 8 offering represents significant improvements in capacity, performance, and reliability over the previous generation, LTO Ultrium 7, while still protecting your investment in the previous technology. Support of LTO 8 Type M cartridge (M8): The LTO Program is introducing a new capability with LTO-8 drives. The ability of the LTO-8 drive to write 9 TB on a brand new LTO-7 cartridge instead of 6 TB as specified by the LTO-7 format. Such a cartridge is called an LTO-7 initialized LTO-8 Type M cartridge. Integrated TS7700 back-end Fibre Channel (FC) switches are available. Up to four library-managed encryption (LME) key paths per logical library are available. This book describes the TS4500 components, feature codes, specifications, supported tape drives, encryption, new integrated management console (IMC), and command-line interface (CLI). You learn how to accomplish the following specific tasks: Improve storage density with increased expansion frame capacity up to 2.4 times and support 33% more tape drives per frame. Manage storage by using the ALMS feature. Improve business continuity and disaster recovery with dual active accessor, automatic control path failover, and data path failover. Help ensure security and regulatory compliance with tape-drive encryption and Write Once Read Many (WORM) media. Support IBM LTO Ultrium 8, 7, 6, and 5, IBM TS1160, TS1155, TS1150, and TS1140 tape drives. Provide a flexible upgrade path for users who want to expand their tape storage as their needs grow. Reduce the storage footprint and simplify cabling with 10 U of rack space on top of the library. This guide is for anyone who wants to understand more about the IBM TS4500 tape library. It is particularly suitable for IBM clients, IBM Business Partners, IBM specialist sales representatives, and technical specialists.

Steve Dine: Modern Cloud Architecture & Mistakes To Avoid When Moving To The Cloud

2018-12-03 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Steve Dine (Datasource Consulting, an EXL company) , Wayne Eckerson (Eckerson Group)

Analytics BI Cyber Security

In this episode, Wayne Eckerson asks Steve Dine about the approach needed to migrate to the Cloud and architecture required to run analytics in the Cloud. Steve Dine talks extensively about the pitfalls to avoid during Cloud migration and finishes off by saying that even though security is a big issue, most organizations will have part of their architecture in the Cloud during the next two-three years. Steve Dine is a BI and enterprise data consultant and industry thought leader who has extensive experience in designing, delivering and managing highly scalable and maintainable modern data architecture solutions.

Pro Power BI Architecture: Sharing, Security, and Deployment Options for Microsoft Power BI Solutions

2018-11-19 · O'Reilly Data Science Books O'Reilly Amazon

book

by reza rad (RADACAD)

BI Microsoft Power BI Cyber Security business-intelligence data data-science microsoft-power-platform power-bi

Architect and deploy a Power BI solution. This book will help you understand the many available options and choose the best combination for hosting, developing, sharing, and deploying a Power BI solution within your organization. Pro Power BI Architecture provides detailed examples and explains the different methods available for sharing and securing Power BI content so that only intended recipients can see it. Commonly encountered problems you will learn to handle include content unexpectedly changing while users are in the process of creating reports and building analysis, methods of sharing analyses that don’t cover all the requirements of your business or organization, and inconsistent security models. The knowledge provided in this book will allow you to choose an architecture and deployment model that suits the needs of your organization, ensuring that you do not spend your time maintaining your solution but onusing it for its intended purpose and gaining business value from mining and analyzing your organization’s data. What You'll Learn Architect and administer enterprise-level Power BI solutions Choose the right sharing method for your Power BI solution Create and manage environments for development, testing, and production Implement row level security in multiple ways to secure your data Save money by choosing the right licensing plan Select a suitable connection type—Live Connection, DirectQuery, or Scheduled Refresh—for your use case Set up a Power BI gateway to bridge between on-premises data sources and the Power BI cloud service Who This Book Is For Data analysts, developers, architects, and managers who want to leverage Power BI for their reporting solution

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

2018-11-19 · Data Engineering Podcast Listen

podcast_episode

by Fabian Hueske (Data Artisans) , Tobias Macey

Flink Data Engineering Data Management Dataflow GCP GitHub Hadoop IBM Java Kafka Scala Spark +2 more

Summary

Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?

What are some use cases that Flink is uniquely qualified to handle?

Where does Flink fit into the current data landscape? How is Flink architected?

How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today?

How does scaling work in a Flink deployment?

What are the scaling limits? What are some of the failure modes that users should be aware of?

How is the statefulness of a cluster managed?

What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster?

What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?

What do they find most interesting or exciting?

Contact Info

LinkedIn @fhueske on Twitter fhueske on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure

IBM Power Systems E870C and E880C Technical Overview and Introduction

2018-11-14 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Scott Vetter , Volker Haug , Alexandre Bicas Caldeira

IBM Linux Marketing Cyber Security Virtual Machine data data-engineering

This IBM® Redpaper™ publication is a comprehensive guide that covers the IBM Power® System E870C (9080-MME) and IBM Power System E880C (9080-MHE) servers that support IBM AIX®, IBM i, and Linux operating systems. The objective of this paper is to introduce the major innovative Power E870C and Power E880C offerings and their relevant functions. The new Power E870C and Power E880C servers with OpenStack-based cloud management and open source automation enables clients to accelerate the transformation of their IT infrastructure for cloud while providing tremendous flexibility during the transition. In addition, the Power E870C and Power E880C models provide clients increased security, high availability, rapid scalability, simplified maintenance, and management, all while enabling business growth and dramatically reducing costs. The systems management capability of the Power E870C and Power E880C servers speeds up and simplifies cloud deployment by providing fast and automated VM deployments, prebuilt image templates, and self-service capabilities, all with an intuitive interface. Enterprise servers provide the highest levels of reliability, availability, flexibility, and performance to bring you a world-class enterprise private and hybrid cloud infrastructure. Through enterprise-class security, efficient built-in virtualization that drives industry-leading workload density, and dynamic resource allocation and management, the server consistently delivers the highest levels of service across hundreds of virtual workloads on a single system. The Power E870C and Power E880C server includes the cloud management software and services to assist with clients' move to the cloud, both private and hybrid. The following capabilities are included: Private cloud management with IBM Cloud PowerVC Manager, Cloud-based HMC Apps as a service, and open source cloud automation and configuration tooling for AIX Hybrid cloud support Hybrid infrastructure management tools Securely connect system of record workloads and data to cloud native applications IBM Cloud Starter Pack Flexible capacity on demand Power to Cloud Services This paper expands the current set of IBM Power Systems™ documentation by providing a desktop reference that offers a detailed technical description of the Power E870C and Power E880C systems. This paper does not replace the latest marketing materials and configuration tools. It is intended as another source of information that, together with existing sources, can be used to enhance your knowledge of IBM server solutions.

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

2018-11-11 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Yoni Iny (Upsolver)

API Avro CloudWatch Kinesis Cassandra CSV Data Engineering Data Lake Data Management Datadog DevOps DWH +13 more

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?

What are your goals for the platform?

There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?

What are the shortcomings of a data lake architecture?

How is Upsolver architected?

How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?

What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?

Contact Info

Yoni

yoniiny on GitHub LinkedIn

Upsolver

Website @upsolver on Twitter LinkedIn Facebook

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Learning Apache Drill

2018-11-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Paul Rogers , Charles Givre

AI/ML CSV Hadoop Apache HBase HDFS Hive JSON Kafka MongoDB Parquet RDBMS S3 +6 more

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster. In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight. Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drill’s functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning

EU GDPR - A Pocket Guide (European) second edition

2018-10-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alan Calder

GDPR/CCPA data data-engineering data-security-privacy eu-general-data-protection-regulation-gdpr eu general data protection regulation (gdpr)

This concise guide is essential reading for EU organisations wanting an easy to follow overview of the new regulation and the compliance obligations for handling data of EU citizens. The EU General Data Protection Regulation (GDPR) will unify data protection and simplify the use of personal data across the EU, and automatically supersedes member states domestic data protection laws. It will also apply to every organisation in the world that processes personal information of EU residents. The Regulation introduces a number of key changes for all organisations that process EU residents’ personal data. EU GDPR: A Pocket Guide provides an essential introduction to this new data protection law, explaining the Regulation and setting out the compliance obligations for EU organisations. This second edition has been updated with improved guidance around related laws such as the NIS Directive and the future ePrivacy Regulation. EU GDPR – A Pocket Guide sets out: A brief history of data protection and national data protection laws in the EU (such as the German BDSG, French LIL and UK DPA). The terms and definitions used in the GDPR, including explanations. The key requirements of the GDPR, including: Which fines apply to which Articles; The six principles that should be applied to any collection and processing of personal data; The Regulation’s applicability; Data subjects’ rights; Data protection impact assessments (DPIAs); The role of the data protection officer (DPO) and whether you need one; Data breaches, and the notification of supervisory authorities and data subjects; Obligations for international data transfers. How to comply with the Regulation, including: Understanding your data, and where and how it is used (e.g. Cloud suppliers, physical records); The documentation you need to maintain (such as statements of the information you collect and process, records of data subject consent, processes for protecting personal data); The “appropriate technical and organisational measures” you need to take to ensure your compliance with the Regulation. A full index of the Regulation, enabling you to find relevant Articles quickly and easily.

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

2018-10-15 · Data Engineering Podcast Listen

podcast_episode

by Ryan Blue (Tabular) , Tobias Macey

API Avro Big Data Data Engineering Data Lake Data Management Data Modelling Git GitHub Hadoop HDFS Hive +6 more

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it?

Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?

How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it?

What is involved in deploying it to a user’s environment?

For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?

Is there a migration path for pre-existing tables into the Iceberg format?

How is schema evolution managed at the file level?

How do you handle files on disk that don’t contain all of the fields specified in a table definition?

One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake?

What are the benefits that outweigh the difficulties?

What have been some of the most challenging or contentious details of the specification to define?

What are some things that you have explicitly left out of the specification?

What are your long-term goals for the Iceberg specification?

Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

rdblue on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

2018-10-09 · Data Engineering Podcast Listen

podcast_episode

by Nikita Shamgunov (Neon) , Tobias Macey

AI/ML API BI Data Engineering Data Management Data Science DWH SQL Tableau

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links MemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

#099: How Technical Should the Analyst Be? With Simo Ahava!

2018-10-09 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Simo Ahava (NetBooster, Helsinki - Finland) , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

AWS GCP JavaScript Python SQL

Are you deeply knowledgable in JavaScript, R, the DOM, Python, AWS, jQuery, Google Cloud Platform, and SQL? Good for you! If you're not, should you be? What does "technical" mean, anyway? And, is it even possible for an analyst to dive into all of these different areas? English philosophy expert The Notorious C.M.O. (aka, Simo Ahava) returns to the show to share his thoughts on the subject in this episode. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Making Data Simple: Little Miss Data builds a data democracy

2018-10-09 · Making Data Simple Listen

podcast_episode

by Laura Ellis (IBM Cloud) , Al Martin (IBM)

AI/ML Analytics Dashboard Data Analytics Data Science IBM

Send us a text Making Data Simple host Al Martin has a chance to discuss all thing data with Laura Ellis, also known as Little Miss Data. Laura is an analytics architect for IBM Cloud as well as a frequent blogger. Together, they talk about how critical it is to understand your data in order create specific calls to action, and what it means to build a data democracy. Show Notes 00:00 - Follow @IBMAnalyticsSupport on Twitter. 00:22 - Check out our YouTube channel. We're posting full episodes weekly. 00:24 - Connect with Al Martin on LinkedIn and Twitter. 01:20 - Check out littlemissdata.com. 01:22 - Connect with Laura Ellis on Twitter, Instagram, and LinkedIn. 02:20 - Curious to know more about analytics architecture? Check out this IBM article on the topic. 03:52 - Check out the Little Miss Data article Al referenced here. 04:45 - Learn more about Data Democracy here in Laura's blog post. 05:31 - Understand more about the importance of data for your business in this article. 09:11 - Find out more about the challenges of being a data scientist here. 12:45 - Working with good quality data is crucial. Check out this article for more details. 16:12 - Simple data can provide the most effective returns. Learn more here. 21:15 - Choosing the right, supportive environment for your data science journey will make sure you don't get burnt out. This article examines your options. 21:35 - Data is a fundamental step when working with AI. But do you know the difference between data analytics, AI and machine learning? This Forbes article walks you through it. 22:42 - Need to brush up on what a data dashboard is? Learn more here. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Database Benchmarking and Stress Testing: An Evidence-Based Approach to Decisions on Architecture and Technology

2018-10-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bert Scalzo

Docker data data-engineering database-architecture

Provide evidence-based answers that can be measured and relied upon by your business. Database administrators will be able to make sound architectural decisions in a fast-changing landscape of virtualized servers and container-based solutions based on the empirical method presented in this book for answering “what if” questions about database performance. Today’s database administrators face numerous questions such as: What if we consolidate databases using multitenant features? What if we virtualize database servers as Docker containers? What if we deploy the latest in NVMe flash disks to speed up IO access? Do features such as compression, partitioning, and in-memory OLTP earn back their price? What if we move our databases to the cloud? As an administrator, do you know the answers or even how to test the assumptions? Database Benchmarking and Stress Testing introduces you to database benchmarking using industry-standard test suites such as the TCP series of benchmarks, which are the same benchmarks that vendors rely upon. You’ll learn to run these industry-standard benchmarks and collect results to use in answering questions about the performance impact of architectural changes, technology changes, and even down to the brand of database software. You’ll learn to measure performance and predict the specific impact of changes to your environment. You’ll know the limitations of the benchmarks and the crucial difference between benchmarking and workload capture/reply. This book teaches you how to create empirical evidence in support of business and technology decisions. It’s about not guessing when you should be measuring. Empirical testing is scientific testing that delivers measurable results. Begin with a hypothesis about the impact of a possible architecture or technology change. Then run the appropriate benchmarks to gather data and predict whether the change you’re exploring will be beneficial, and by what order of magnitude. Stop guessing. Start measuring. Let Database Benchmarking and Stress Testing show the way. What You'll Learn Understand the industry-standard database benchmarks, and when each is best used Prepare for a database benchmarking effort so reliable results can be achieved Perform database benchmarking for consolidation, virtualization, and cloud projects Recognize and avoid common mistakes in benchmarking database performance Measure and interpret results in a rational, concise manner for reliable comparisons Choose and provide advice on benchmarking tools based on their pros and cons Who This Book Is For Database administrators and professionals responsible for advising on architectural decisions such as whether to use cloud-based services, whether to consolidate and containerize, and who must make recommendations on storage or any other technology that impacts database performance

IBM z14 Model ZR1 Technical Introduction

2018-10-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Octavian Lascu

Agile/Scrum Analytics IBM data data-engineering

Abstract This IBM® Redbooks® publication introduces the latest member of the IBM Z platform, the IBM z14 Model ZR1 (Machine Type 3907). It includes information about the Z environment and how it helps integrate data and transactions more securely, and provides insight for faster and more accurate business decisions. The z14 ZR1 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z14 ZR1 is designed for enhanced modularity, which is in an industry standard footprint. This system excels at the following tasks: Securing data with pervasive encryption Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Providing resilience towards zero downtime Accelerating digital transformation with agile service delivery Revolutionizing business processes Mixing open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z14 ZR1 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

IBM z14 Technical Introduction

2018-10-02 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Octavian Lascu

Agile/Scrum Analytics IBM data data-engineering

Abstract This IBM® Redbooks® publication introduces the latest IBM z platform, the IBM z14™. It includes information about the Z environment and how it helps integrate data and transactions more securely, and can infuse insight for faster and more accurate business decisions. The z14 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to the digital era and the trust economy. This system includes the following functionality: Securing data with pervasive encryption Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Providing resilience with key to zero downtime Accelerating digital transformation with agile service delivery Revolutionizing business processes Blending open source and Z technologies This book explains how this system uses both new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and mobile applications. With the z14 as the base, applications can run in a trusted, reliable, and secure environment that both improves operations and lessens business risk.

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

2018-10-01 · Data Engineering Podcast Listen

podcast_episode

by Chris Groskopf (Enigma) , Tobias Macey

AI/ML API Data Engineering Data Management Data Science ETL/ELT

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

Introduction How did you get involved in the area of data management? Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?

How do you define the concept of a knowledge graph?

What are the processes involved in constructing a knowledge graph? Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph? What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?

How do you manage the software lifecycle for your ETL code? What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?

What are the current challenges that you are facing in building and scaling your data infrastructure?

How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose? What techniques are you using to manage accuracy and consistency in the data that you ingest?

Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers? What are the weak spots in your platform that you are planning to address in upcoming projects?

If you were to start from scratch today, what would you have done differently?

What are some of the most interesting or unexpected uses of your product that you have seen? What is in store for the future of Enigma?

Contact Info

Email Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Enigma Chicago Tribune NPR Quartz CSVKit Aga

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

2018-09-24 · Data Engineering Podcast Listen

podcast_episode

by Todd Walter , Tobias Macey

AI/ML API Big Data Data Engineering Data Lake Data Management Data Science DWH ETL/ELT

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence

Interview

Introduction How did you get involved in the area of data management? How do you define data curation?

What are some of the high level concerns that are encapsulated in that effort?

How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?

What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?

Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approac

IBM Spectrum Scale Security

2018-09-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alifiya Kantawala , Larry Coyne , Sandeep R Patil , Felipe Knop

Data Management Hadoop IBM Cyber Security data data-engineering

Storage systems must provide reliable and convenient data access to all authorized users while simultaneously preventing threats coming from outside or even inside the enterprise. Security threats come in many forms, from unauthorized access to data, data tampering, denial of service, and obtaining privileged access to systems. According to the Storage Network Industry Association (SNIA), data security in the context of storage systems is responsible for safeguarding the data against theft, prevention of unauthorized disclosure of data, prevention of data tampering, and accidental corruption. This process ensures accountability, authenticity, business continuity, and regulatory compliance. Security for storage systems can be classified as follows: Data storage (data at rest, which includes data durability and immutability) Access to data Movement of data (data in flight) Management of data IBM® Spectrum Scale is a software-defined storage system for high performance, large-scale workloads on-premises or in the cloud. IBM Spectrum™ Scale addresses all four aspects of security by securing data at rest (protecting data at rest with snapshots, and backups and immutability features) and securing data in flight (providing secure management of data, and secure access to data by using authentication and authorization across multiple supported access protocols). These protocols include POSIX, NFS, SMB, Hadoop, and Object (REST). For automated data management, it is equipped with powerful information lifecycle management (ILM) tools that can help administer unstructured data by providing the correct security for the correct data. This IBM Redpaper™ publication details the various aspects of security in IBM Spectrum Scale™, including the following items: Security of data in transit Security of data at rest Authentication Authorization Hadoop security Immutability Secure administration Audit logging Security for transparent cloud tiering (TCT) Security for OpenStack drivers Unless stated otherwise, the functions that are mentioned in this paper are available in IBM Spectrum Scale V4.2.1 or later releases.

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

2018-09-17 · Data Engineering Podcast Listen

podcast_episode

by Alexander Dean (Snowplow Analytics) , Tobias Macey

AI/ML Analytics API AWS Amazon EMR Kinesis BI CRM Data Collection Data Engineering Data Management Data Science +13 more

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

2018-09-10 · Data Engineering Podcast Listen

podcast_episode

by Thomas Hazel (Chaos Search) , Pete Cheslock (ThreatStack) , Tobias Macey

AI/ML API Athena AWS Data Engineering Data Management Data Science ELK Presto S3 SQL

Summary

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?

What types of data are you focused on supporting? What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?

Is there any need for an Elasticsearch cluster in addition to Chaos Search? For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3? What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL? Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS? What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster? What is the system architecture that you have built to allow for querying terabytes of data in S3?

What are the biggest contributors to query latency and what have you done to mitigate them?

What are the options for access control when running queries against the data stored in S3? What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen? What are your plans for the future of Chaos Search?

Contact Info

Pete Cheslock

@petecheslock on Twitter Website

Thomas Hazel

@thomashazel on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tool

talk-data.com

Activity Trend

Top Events

Top Speakers

IBM TS4500 R5 Tape Library Guide

Steve Dine: Modern Cloud Architecture & Mistakes To Avoid When Moving To The Cloud

Pro Power BI Architecture: Sharing, Security, and Deployment Options for Microsoft Power BI Solutions

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

IBM Power Systems E870C and E880C Technical Overview and Introduction

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Learning Apache Drill

EU GDPR - A Pocket Guide (European) second edition

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

#099: How Technical Should the Analyst Be? With Simo Ahava!

Making Data Simple: Little Miss Data builds a data democracy

Database Benchmarking and Stress Testing: An Evidence-Based Approach to Decisions on Architecture and Technology

IBM z14 Model ZR1 Technical Introduction

IBM z14 Technical Introduction

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

IBM Spectrum Scale Security

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47