Analytics

Simulating Business Processes for Descriptive, Predictive, and Prescriptive Analytics

2019-10-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Andrew Greasley

Process Mining business-intelligence data data-science prescriptive-analytics

This book outlines the benefits and limitations of simulation, what is involved in setting up a simulation capability in an organization, the steps involved in developing a simulation model and how to ensure that model results are implemented. In addition, detailed example applications are provided to show where the tool is useful and what it can offer the decision maker. In Simulating Business Processes for Descriptive, Predictive, and Prescriptive Analytics, Andrew Greasley provides an in-depth discussion of Business process simulation and how it can enable business analytics How business process simulation can provide speed, cost, dependability, quality, and flexibility metrics Industrial case studies including improving service delivery while ensuring an efficient use of staff in public sector organizations such as the police service, testing the capacity of planned production facilities in manufacturing, and ensuring on-time delivery in logistics systems State-of-the-art developments in business process simulation regarding the generation of simulation analytics using process mining and modeling people’s behavior Managers and decision makers will learn how simulation provides a faster, cheaper and less risky way of observing the future performance of a real-world system. The book will also benefit personnel already involved in simulation development by providing a business perspective on managing the process of simulation, ensuring simulation results are implemented, and that performance is improved.

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

2019-10-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Bob Ward (Azure Data)

AI/ML Azure Big Data Data Analytics Data Lake ETL/ELT Hadoop HDFS Java Kubernetes Linux MongoDB +8 more

Get up to speed on the game-changing developments in SQL Server 2019. No longer just a database engine, SQL Server 2019 is cutting edge with support for machine learning (ML), big data analytics, Linux, containers, Kubernetes, Java, and data virtualization to Azure. This is not a book on traditional database administration for SQL Server. It focuses on all that is new for one of the most successful modernized data platforms in the industry. It is a book for data professionals who already know the fundamentals of SQL Server and want to up their game by building their skills in some of the hottest new areas in technology. SQL Server 2019 Revealed begins with a look at the project's team goal to integrate the world of big data with SQL Server into a major product release. The book then dives into the details of key new capabilities in SQL Server 2019 using a “learn by example” approach for Intelligent Performance, security, mission-criticalavailability, and features for the modern developer. Also covered are enhancements to SQL Server 2019 for Linux and gain a comprehensive look at SQL Server using containers and Kubernetes clusters. The book concludes by showing you how to virtualize your data access with Polybase to Oracle, MongoDB, Hadoop, and Azure, allowing you to reduce the need for expensive extract, transform, and load (ETL) applications. You will then learn how to take your knowledge of containers, Kubernetes, and Polybase to build a comprehensive solution called Big Data Clusters, which is a marquee feature of 2019. You will also learn how to gain access to Spark, SQL Server, and HDFS to build intelligence over your own data lake and deploy end-to-end machine learning applications. What You Will Learn Implement Big Data Clusters with SQL Server, Spark, and HDFS Create a Data Hub with connections to Oracle, Azure, Hadoop, and other sources Combine SQL and Spark to build a machine learning platform for AI applications Boost your performance with no application changes using Intelligent Performance Increase security of your SQL Server through Secure Enclaves and Data Classification Maximize database uptime through online indexing and Accelerated Database Recovery Build new modern applications with Graph, ML Services, and T-SQL Extensibility with Java Improve your ability to deploy SQL Server on Linux Gain in-depth knowledge to run SQL Server with containers and Kubernetes Know all the new database engine features for performance, usability, and diagnostics Use the latest tools and methods to migrate your database to SQL Server 2019 Apply your knowledge of SQL Server 2019 to Azure Who This Book Is For IT professionals and developers who understand the fundamentals of SQL Server and wish to focus on learning about the new, modern capabilities of SQL Server 2019. The book is for those who want to learn about SQL Server 2019 and the new Big Data Clusters and AI feature set, support for machine learning and Java, how to run SQL Server with containers and Kubernetes, and increased capabilities around Intelligent Performance, advanced security, and high availability.

Keeping Your Data Warehouse In Order With DataForm

2019-10-15 · Data Engineering Podcast Listen

podcast_episode

by Lewis Hemens (Dataform) , Tobias Macey

AI/ML Airflow AWS Big Data Data Engineering Data Management Datacoral Google Dataform DWH Iceberg Presto Spark +3 more

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analy

Why you should Create an Analytics Strategy Playbook w/ Ryan Goodman

2019-10-10 · Analytics on Fire Listen

podcast_episode

by Mico Yuk (Data Storytelling Academy) , Ryan Goodman

BI Dashboard

Today's guest is no stranger to most of you. Ryan Goodman is a BI rock star and my former AoF co-host. Currently, he's the Head of Analytics at a super cool startup turned ongoing profitable and growing financial services company.

Ryan's going to teach us how to create an analytics playbook using the BI Dashboard Formula (BIDF) methodology. An analytics strategy playbook encapsulates and executes all the elements of people, technology, and process. Stay tuned to get Ryan's playbook template!

In this episode, you'll learn: [06:40] BI Whisperer: Bridging the gap between technology and business analysts. [08:20] Disconnecting from being a business owner back into driving customer analytics. [09:16] Key Quote: Everyone is data driven...what that means, and what it should mean. - Ryan Goodman For full show notes, his book give away, and the links mentioned visit: https://bibrainz.com/podcast/35 Sponsor This exciting season of AOF is sponsored by our BI Data Storytelling Mastery Accelerator 3-Day Live workshop. Our second one is coming up on Jan 28-30 and registration is open! Join us and consider upgrading to be a VIP (we have tons of bonuses planned). Many BI teams are still struggling to deliver consistent, high-engaging analytics their users love. At the end of three days, you'll leave with the tools, techniques, and resources you need to engage your users. Register today! Enjoyed the Show? Please leave us a review on iTunes.

IBM z15 Technical Introduction

2019-10-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Frank Packheiser , Jannie Houlbjerg , Kazuhiro Nakajima , John Troy , Bill White , Paul Schouten , Octavian Lascu , Anna Shugol , Hervey Kamga , Bo XU

Agile/Scrum Cloud Computing IBM data data-engineering

This IBM® Redbooks® publication introduces the latest member of the IBM Z® platform, the IBM z15™ (machine type 8561). It includes information about the Z environment and how it helps integrate data and transactions more securely. It also provides insight for faster and more accurate business decisions. The z15 is a state-of-the-art data and transaction system that delivers advanced capabilities, which are vital to any digital transformation. The z15 is designed for enhanced modularity, which is in an industry-standard footprint. The z15 system excels at the following tasks: Using multicloud integration services Securing data with pervasive encryption Providing resilience with key to zero downtime Transforming a transactional platform into a data powerhouse Getting more out of the platform with IT Operational Analytics Accelerating digital transformation with agile service delivery Revolutionizing business processes Blending open source and Z technologies This book explains how this system uses new innovations and traditional Z strengths to satisfy growing demand for cloud, analytics, and open source technologies. With the z15 as the base, applications can run in a trusted, reliable, and secure environment that improves operations and lessens business risk.

Fast Analytics On Semi-Structured And Structured Data In The Cloud

2019-10-08 · Data Engineering Podcast Listen

podcast_episode

by Venkat Venkataramani (Rockset) , Shruti Bhat (Rockset) , Tobias Macey

AI/ML AWS Big Data Cloud Computing Data Engineering Data Management Datacoral SQL Data Streaming

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it?

What are some of the use cases that it enables which would otherwise be impractical or intractable?

How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer?

How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud?

You have a whitepaper describing your ar

How Charts Lie w/ Alberto Cairo

2019-10-03 · Analytics on Fire Listen

podcast_episode

by Alberto Cairo , Mico Yuk (Data Storytelling Academy)

BI

Guess what? I missed you! After focusing on making our first BI Data Storytelling Accelerator - 3-day live workshop a major success, I'm back! Registration is now open for the next workshop on January 28-30, 2020. It's the hottest thing since fried rice!

Today, I talk to the amazing Alberto Cairo, author of How Charts Lie. He's an AoF alumni, who was on my very first AoF podcast episode where we were the first to announce The Functional Art (one of the most well-known books in our industry). Stay tuned to the end for an insane book giveaway you won't find anywhere else!

In this episode, you'll learn: [12:00] Learn why Alberto warns: "Be aware of what some people call the Curse of Knowledge." [19:11] Key Quote: A visualization can only be worth a 1,000 words if we know how to read it. - Alberto Cairo [21:41] Key ways to overcome biases: Steps to read a visual. Read the title, message, and patterns For full show notes, his book give away, and the links mentioned visit: https://bibrainz.com/podcast/34 Sponsor This exciting season of AOF is sponsored by our BI Data Storytelling Mastery Accelerator 3-Day Live workshop. Our second one is coming up on Jan 28-30 and registration is open! Join us and consider upgrading to be a VIP (we have tons of bonuses planned). Many BI teams are still struggling to deliver consistent, high-engaging analytics their users love. At the end of three days, you'll leave with the tools, techniques, and resources you need to engage your users. Register today! Enjoyed the Show? Please leave us a review on iTunes.

Alan Jacobson: How to Deliver Business Value from Advanced Analytics

2019-10-02 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Alan Jacobson (Alteryx)

Alteryx Data Management Data Science

Companies that excel at advanced analytics and data science maximize the value of their data. They unearth hidden opportunities and become innovators in the industry. Although each organization has different goals, the underlying processes and tools to become successful at analytics remain somewhat the same. In this episode, Alan Jacobson explains them one by one and finishes off with his top three recommendations.

Alan Jacobson is the chief data and analytics officer (CDAO) of Alteryx, driving key data initiatives and accelerating digital business transformation for the Alteryx global customer base. As CDAO, Jacobson leads the company’s data science practice as a best-in-class example of how a company can get maximum leverage out of its data and the insights it contains, responsible for data management and governance, product and internal data, and use of the Alteryx Platform to drive continued growth.

Alan was recognized as a top leader in the global automotive industry as an Automotive Hall of Fame Leadership & Excellence award winner and an Outstanding Engineer of the Year by the Engineering Society of Detroit, and works with the National Academy of Engineering and other organizations as an advisor on data science topics.

IBM Spectrum Discover: Metadata Management for Deep Insight of Unstructured Storage

2019-10-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mathias Defiebre , Isom Crawford Jr. , Larry Coyne , Joseph Dain , Norman Bogard

Cloud Computing IBM data data-engineering

This IBM® Redpaper publication provides a comprehensive overview of the IBM Spectrum® Discover metadata management software platform. We give a detailed explanation of how the product creates, collects, and analyzes metadata. Several in-depth use cases are used that show examples of analytics, governance, and optimization. We also provide step-by-step information to install and set up the IBM Spectrum Discover trial environment. More than 80% of all data that is collected by organizations is not in a standard relational database. Instead, it is trapped in unstructured documents, social media posts, machine logs, and so on. Many organizations face significant challenges to manage this deluge of unstructured data such as: Pinpointing and activating relevant data for large-scale analytics Lacking the fine-grained visibility that is needed to map data to business priorities Removing redundant, obsolete, and trivial (ROT) data Identifying and classifying sensitive data IBM Spectrum Discover is a modern metadata management software that provides data insight for petabyte-scale file and Object Storage, storage on premises, and in the cloud. This software enables organizations to make better business decisions and gain and maintain a competitive advantage. IBM Spectrum Discover provides a rich metadata layer that enables storage administrators, data stewards, and data scientists to efficiently manage, classify, and gain insights from massive amounts of unstructured data. It improves storage economics, helps mitigate risk, and accelerates large-scale analytics to create competitive advantage and speed critical research.

IBM Power Systems Enterprise AI Solutions

2019-09-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Scott Vetter , Andrew Laidlaw , Marcos Quezada , Glen Corneau

AI/ML Data Science IBM NLP data data-engineering ibm-power-systems

This IBM® Redpaper publication helps the line of business (LOB), data science, and information technology (IT) teams develop an information architecture (IA) for their enterprise artificial intelligence (AI) environment. It describes the challenges that are faced by the three roles when creating and deploying enterprise AI solutions, and how they can collaborate for best results. This publication also highlights the capabilities of the IBM Cognitive Systems and AI solutions: IBM Watson® Machine Learning Community Edition IBM Watson Machine Learning Accelerator (WMLA) IBM PowerAI Vision IBM Watson Machine Learning IBM Watson Studio Local IBM Video Analytics H2O Driverless AI IBM Spectrum® Scale IBM Spectrum Discover This publication examines the challenges through five different use case examples: Artificial vision Natural language processing (NLP) Planning for the future Machine learning (ML) AI teaming and collaboration This publication targets readers from LOBs, data science teams, and IT departments, and anyone that is interested in understanding how to build an IA to support enterprise AI development and deployment.

#124: Image-ine What the Analyst Can Do Using Machine Vision with Ali Vanderveld

2019-09-24 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Ali Vanderveld (ShopRunner) , Tim Wilson (Analytics Power Hour - Columbus (OH) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

AI/ML Data Science

Have you ever noticed that 68.2% of the people who explain machine learning use a "this picture is a cat" example, and another 24.3% use "this picture is a dog?" Is there really a place for machine learning and the world of computer vision (or machine vision, which we have conclusively determined is a synonym) in the real world of digital analytics? The short answer is the go-to answer of every analyst: it depends. On this episode, we sat down with Ali Vanderveld, Director of Data Science at ShopRunner, to chat about some real world applications of computer vision, as well as the many facets and considerations therein! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

SAS for R Users

2019-09-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ajay Ohri

Cloud Computing Data Science Python SAS analytics-platforms data data-science

BRIDGES THE GAP BETWEEN SAS AND R, ALLOWING USERS TRAINED IN ONE LANGUAGE TO EASILY LEARN THE OTHER SAS and R are widely-used, very different software environments. Prized for its statistical and graphical tools, R is an open-source programming language that is popular with statisticians and data miners who develop statistical software and analyze data. SAS (Statistical Analysis System) is the leading corporate software in analytics thanks to its faster data handling and smaller learning curve. SAS for R Users enables entry-level data scientists to take advantage of the best aspects of both tools by providing a cross-functional framework for users who already know R but may need to work with SAS. Those with knowledge of both R and SAS are of far greater value to employers, particularly in corporate settings. Using a clear, step-by-step approach, this book presents an analytics workflow that mirrors that of the everyday data scientist. This up-to-date guide is compatible with the latest R packages as well as SAS University Edition. Useful for anyone seeking employment in data science, this book: Instructs both practitioners and students fluent in one language seeking to learn the other Provides command-by-command translations of R to SAS and SAS to R Offers examples and applications in both R and SAS Presents step-by-step guidance on workflows, color illustrations, sample code, chapter quizzes, and more Includes sections on advanced methods and applications Designed for professionals, researchers, and students, SAS for R Users is a valuable resource for those with some knowledge of coding and basic statistics who wish to enter the realm of data science and business analytics. AJAY OHRI is the founder of analytics startup Decisionstats.com. His research interests include spreading open source analytics, analyzing social media manipulation with mechanism design, simpler interfaces to cloud computing, investigating climate change, and knowledge flows. He currently advises startups in analytics off shoring, analytics services, and analytics. He is the author of Python for R Users: A Data Science Approach (Wiley), R for Business Analytics, and R for Cloud Computing.

Open Source Object Storage For All Of Your Data

2019-09-23 · Data Engineering Podcast Listen

podcast_episode

by Anand Babu Periasamy (MinIO) , Tobias Macey

AI/ML API Big Data Cloud Computing Data Engineering Data Management HDFS Marketing S3 Data Streaming

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.

Interview

Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?

Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)

What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?

What are the constraints and opportunities that are provided by adhering to that API?

Can you describe how MinIO is implemented and the overall system design?

How has that design evolved since you first began working on it?

What assumptions did you have at the outset and how have they been challenged or updated?

What are the axes for scaling that MinIO provides and how does it handle clustering?

Where does it fall on the axes of availability and consistency in the CAP theorem?

One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintain

Engenharia de Dados no Grupo Zap - Data Hackers Podcast 15

2019-09-20 · Data Hackers Listen

podcast_episode

by André Ronquetti (Grupo Zap) , Rafael Paixão (Grupo Zap) , Talita Barcelos (Grupo Zap)

Data Engineering

O que um engenheiro de dados faz em uma empresa do mercado imobiliário? Como você faz para fundir duas gigantes do mercado sem que a empresa quebre no processo? Como eles conseguiram usar uma stack totalmente open-source? É isso que você vai descobrir nesse episódio irado, onde convidamos Talita Barcelos — Gerente de Data & Analytics — ; André Ronquetti — Senior Engineering Manager — e Rafael Paixão — Data Engineer — para bater um papo sobre como é o trabalho do time de Data Engineering no Grupo Zap.

Acesse nosso post no Medium para ver os links e notícias do episódio: https://medium.com/data-hackers/engenharia-de-dados-no-grupo-zap-data-hackers-podcast-15-8df2f974844b

Navigating Boundless Data Streams With The Swim Kernel

2019-09-18 · Data Engineering Podcast Listen

podcast_episode

by Simon Crosby (Swim Inc.) , Tobias Macey

AI/ML Big Data Data Engineering Data Management Fabric Redshift Data Streaming

Summary The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Swim.ai is and how the project and business got started?

Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?

What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable? How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms? Can you describe a typical design for an application or system being built on top of the Swim platform?

What does the developer workflow look like?

What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?

Can you describe the internal design for the SwimOS and ho

Model Management and Analytics for Large Scale Systems

2019-09-14 · O'Reilly Data Science Books O'Reilly Amazon

book

by Mehmet Aksit , Loek Cleophas , Bedir Tekinerdogan , Mark van den Brand , Önder Babur

Big Data Data Analytics Data Management Data Science data data-science

Model Management and Analytics for Large Scale Systems covers the use of models and related artefacts (such as metamodels and model transformations) as central elements for tackling the complexity of building systems and managing data. With their increased use across diverse settings, the complexity, size, multiplicity and variety of those artefacts has increased. Originally developed for software engineering, these approaches can now be used to simplify the analytics of large-scale models and automate complex data analysis processes. Those in the field of data science will gain novel insights on the topic of model analytics that go beyond both model-based development and data analytics. This book is aimed at both researchers and practitioners who are interested in model-based development and the analytics of large-scale models, ranging from big data management and analytics, to enterprise domains. The book could also be used in graduate courses on model development, data analytics and data management. Identifies key problems and offers solution approaches and tools that have been developed or are necessary for model management and analytics Explores basic theory and background, current research topics, related challenges and the research directions for model management and analytics Provides a complete overview of model management and analytics frameworks, the different types of analytics (descriptive, diagnostics, predictive and prescriptive), the required modelling and method steps, and important future directions

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

2019-09-08 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dino Quintero , Frank N. Lee

AI/ML Big Data Cloud Computing Data Management IBM Spark data data-engineering

This IBM® Redpaper publication provides an update to the original description of IBM Reference Architecture for Genomics. This paper expands the reference architecture to cover all of the major vertical areas of healthcare and life sciences industries, such as genomics, imaging, and clinical and translational research. The architecture was renamed IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences to reflect the fact that it incorporates key building blocks for high-performance computing (HPC) and software-defined storage, and that it supports an expanding infrastructure of leading industry partners, platforms, and frameworks. The reference architecture defines a highly flexible, scalable, and cost-effective platform for accessing, managing, storing, sharing, integrating, and analyzing big data, which can be deployed on-premises, in the cloud, or as a hybrid of the two. IT organizations can use the reference architecture as a high-level guide for overcoming data management challenges and processing bottlenecks that are frequently encountered in personalized healthcare initiatives, and in compute-intensive and data-intensive biomedical workloads. This reference architecture also provides a framework and context for modern healthcare and life sciences institutions to adopt cutting-edge technologies, such as cognitive life sciences solutions, machine learning and deep learning, Spark for analytics, and cloud computing. To illustrate these points, this paper includes case studies describing how clients and IBM Business Partners alike used the reference architecture in the deployments of demanding infrastructures for precision medicine. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing life sciences solutions and support.

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

2019-09-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pramod Singh

AI/ML Airflow Big Data GitHub PySpark Python Spark Data Streaming apache-spark data data-engineering

Leverage machine and deep learning models to build applications on real-time data using PySpark. This book is perfect for those who want to learn to use this language to perform exploratory data analysis and solve an array of business challenges. You'll start by reviewing PySpark fundamentals, such as Spark’s core architecture, and see how to use PySpark for big data processing like data ingestion, cleaning, and transformations techniques. This is followed by building workflows for analyzing streaming data using PySpark and a comparison of various streaming platforms. You'll then see how to schedule different spark jobs using Airflow with PySpark and book examine tuning machine and deep learning models for real-time predictions. This book concludes with a discussion on graph frames and performing network analysis using graph algorithms in PySpark. All the code presented in the book will be available in Python scripts on Github. What You'll Learn Develop pipelines for streaming data processing using PySpark Build Machine Learning & Deep Learning models using PySpark latest offerings Use graph analytics using PySpark Create Sequence Embeddings from Text data Who This Book is For Data Scientists, machine learning and deep learning engineers who want to learn and use PySpark for real time analysis on streaming data.

Alan Jacobson: How to Deliver ROI from Analytics and Data Science

2019-09-03 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Alan Jacobson (Alteryx)

AI/ML Alteryx Data Management Data Science Marketing

With the growing popularity of machine learning and artificial intelligence, creating a data science program is a key initiative at most companies today. However, it’s not always clear to executives how they can deliver a return on investments in data science. To explain this, we invited an expert who has spent most of his career in the data science trenches and has a clear-minded perspective on how to deliver ROI with data science.

Alan Jacobson is the chief data and analytics officer (CDAO) of Alteryx, driving key data initiatives and accelerating digital business transformation for the Alteryx global customer base. As CDAO, Jacobson leads the company’s data science practice as a best-in-class example of how a company can get maximum leverage out of its data and the insights it contains, responsible for data management and governance, product and internal data, and use of the Alteryx Platform to drive continued growth.

Prior to joining Alteryx, Alan held a variety of leadership roles at Ford Motor Company across engineering, marketing, sales and new business development; most recently leading a team of data scientists to drive digital transformation across the enterprise. As an Alteryx evangelist at Ford, Alan spent many years leveraging the Alteryx Platform across the company and witnessed first-hand the impact a culture of analytics can have on the bottom line and what it takes to succeed as a data-driven enterprise.

How to Organize a Data Analytics Program by Wayne Eckerson - Audio Blog

2019-09-02 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Wayne Eckerson (Eckerson Group)

Data Analytics

How do you organize a data analytics program to maximize value for the organization? Although there is no right or wrong way to do this, several patterns emerge when you examine successful organizations.

Originally published at https://www.eckerson.com/articles/organizing-for-success-part-ii-how-to-organize-a-data-analytics-program

talk-data.com

Activity Trend

Top Events

Top Speakers

Simulating Business Processes for Descriptive, Predictive, and Prescriptive Analytics

SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning

Keeping Your Data Warehouse In Order With DataForm

Why you should Create an Analytics Strategy Playbook w/ Ryan Goodman

IBM z15 Technical Introduction

Fast Analytics On Semi-Structured And Structured Data In The Cloud

How Charts Lie w/ Alberto Cairo

Alan Jacobson: How to Deliver Business Value from Advanced Analytics

IBM Spectrum Discover: Metadata Management for Deep Insight of Unstructured Storage

IBM Power Systems Enterprise AI Solutions

#124: Image-ine What the Analyst Can Do Using Machine Vision with Ali Vanderveld

SAS for R Users

Open Source Object Storage For All Of Your Data

Engenharia de Dados no Grupo Zap - Data Hackers Podcast 15

Navigating Boundless Data Streams With The Swim Kernel

Model Management and Analytics for Large Scale Systems

IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences

Learn PySpark: Build Python-based Machine Learning and Deep Learning Models

Alan Jacobson: How to Deliver ROI from Analytics and Data Science

How to Organize a Data Analytics Program by Wayne Eckerson - Audio Blog