Python

Enabling Version Controlled Data Collaboration With TerminusDB

2021-01-11 · Data Engineering Podcast Listen

podcast_episode

by Gavin Mendel-Gleason (TerminusDB) , Tobias Macey

AI/ML API BI Computer Science Data Engineering Data Management Datadog ETL/ELT Kubernetes Monte Carlo New Relic

Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what TerminusDB is and what motivated you to build it? What are the use cases that TerminusDB and TerminusHub are designed for? There are a number of different reasons and methods for versioning data, such as th

Bringing Feature Stores and MLOps to the Enterprise at Tecton

2021-01-05 · Data Engineering Podcast Listen

podcast_episode

by Kevin Stumpf (Tecton) , Tobias Macey

AI/ML Analytics API BI Computer Science Data Engineering Data Management Datadog ETL/ELT Kubernetes MLOps Monte Carlo +1 more

Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Tecton and your motivation for starting the business? For anyone who isn’t familiar with the concept, what is an example of a feature? How do you define what a feature store is? What role does a feature store play in the overall lifecycle of a machine learning p

Off The Shelf Data Governance With Satori

2020-12-28 · Data Engineering Podcast Listen

podcast_episode

by Yoav Cohen (Satori) , Tobias Macey

Data Engineering Data Governance Data Management Kubernetes Cyber Security

Summary One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Yoav Cohen about Satori, a data access service to monitor, classify and control access to sensitive data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Satori?

What is the story behind the product and company?

How does Satori compare to other tools and products for managing access control and governance for data assets? What are the biggest challenges that organizations face in establishing and enforcing policies for their data? What are the main goals for the Satori product and what use cases does it enable? Can you describe how the Satori platform is architected?

How has the design of the platform evolved since you first began working on it?

How have your experiences working in cyber security informed your approach to data governance? How does the design of the Satori platform simplify technical aspects of data governance?

What aspects of governance do you delegate to other systems or platforms?

What elements of data infrastructure does Satori integrate with?

For someone who is adopting Satori, what is involved in getting it deployed and set up with their existing data platforms?

What do you see as being the most complex or underserved aspects of data governance?

How much of that complexity is inherent to the problem vs. being a result of how the industry has evolved?

What are some of the most interesting, innovative, or unexpected ways that you have seen the Satori platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Satori? When is Satori the wrong choice? What do you have planned for the future of the platform?

Contact Info

LinkedIn @yoavcohen on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Satori Data Governance Data Masking TLS == Transport Layer Security

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Low Friction Data Governance With Immuta

2020-12-21 · Data Engineering Podcast Listen

podcast_episode

by Steve Touw (Immuta) , Stephen Bailey (Whatnot) , Tobias Macey

AI/ML API BI Dashboard Data Engineering Data Governance Data Management Datadog ETL/ELT Kubernetes Monte Carlo New Relic

Summary Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data inf

Presenting: SQLFluff

2020-12-18 · dbt Coalesce 2020 Watch

video

by Alan Cruickshank

dbt JavaScript SQL SQLFluff

The dbt project at tails.com has over 600 models and 66k lines of code. With multiple contributors to a project and varying SQL backgrounds, it's really difficult to maintain consistent readability and comprehension across a codebase like that by hand.

Python has flake8, Javascript has JSLint, but SQL...?

Listen to this talk from Alan Cruickshank to find out whether SQLFluff might help your teams be more productive with SQL.

Episode 2: Our Favorite Data Structures

2020-12-04 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Bryce Adelstein Lelbach (NVIDIA)

Java JavaScript Rust Scala

In this episode, Bryce and Conor talk about each of their favorite data structures. Date Recorded: 2020-11-28 Date Released: 2020-12-04 C++ | Containers OCaml | Containers Java | Collections Python | Collections Kotlin | Collections Scala | Collections Rust | Collections Go | Collections Haskell | Collections TS | Collections Ruby | Collections JS | Collections F# | Collection Types Racket | Data Structures Clojure | Data Structures What do you mean by “cache friendly”? - Björn Fahller - code::dive 2019Alan J. Perlis’ Epigrams on Programmingstd::vectorP1072 basic_string::resize_default_initstd::arraystd::unique_ptr (Array Specialization)P0316 allocate_unique and allocator_deletethurst::allocate_uniqueIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Essential Statistics for Non-STEM Data Analysts

2020-11-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Rongpeng Li

AI/ML Data Science data data-science data-science-tasks statistics

Essential Statistics for Non-STEM Data Analysts is your comprehensive guide to mastering the statistical concepts needed for data science. By working through real-world datasets and Python-based examples, you'll learn how to interpret data and build insightful analyses. This book demystifies statistics, making it accessible to anyone aiming to become proficient in data analysis. What this Book will help me do Learn how to preprocess, clean, and prepare data for analysis using Python. Master the foundations of statistical methods such as hypothesis testing and probability theory. Develop skills to interpret and explain statistical results in the context of data science. Understand how statistical concepts apply to machine learning tasks like classification and regression. Build confidence in statistical principles to tackle interviews and enhance your career prospects. Author(s) None Li is an experienced data scientist and educator with a strong focus on making abstract statistical concepts intuitive and applicable. With a background in designing data science curriculums, None has a passion for teaching statistics to individuals from diverse and often non-mathematical backgrounds. Through clear explanations and practical examples, None aims to empower everyone to excel in data analysis and machine learning. Who is it for? This book caters specifically to data analysts, data science enthusiasts, and developers eager to enhance their statistical knowledge. It's crafted for readers transitioning into data science who may lack a strong mathematical or statistics background. If you have a basic grasp of Python programming and a keen interest in understanding how to work effectively with data, this book is a perfect fit. Beginners and students aiming to familiarize themselves with statistical foundations for data-oriented careers will greatly benefit from this resource.

Python for Algorithmic Trading

2020-11-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Yves Hilpisch

AI/ML Analytics NumPy Pandas Data Streaming data data-science

Algorithmic trading, once the exclusive domain of institutional players, is now open to small organizations and individual traders using online platforms. The tool of choice for many traders today is Python and its ecosystem of powerful packages. In this practical book, author Yves Hilpisch shows students, academics, and practitioners how to use Python in the fascinating field of algorithmic trading. You'll learn several ways to apply Python to different aspects of algorithmic trading, such as backtesting trading strategies and interacting with online trading platforms. Some of the biggest buy- and sell-side institutions make heavy use of Python. By exploring options for systematically building and deploying automated algorithmic trading strategies, this book will help you level the playing field. Set up a proper Python environment for algorithmic trading Learn how to retrieve financial data from public and proprietary data sources Explore vectorization for financial analytics with NumPy and pandas Master vectorized backtesting of different algorithmic trading strategies Generate market predictions by using machine learning and deep learning Tackle real-time processing of streaming data with socket programming tools Implement automated algorithmic trading strategies with the OANDA and FXCM trading platforms

Machine Learning and Data Science Blueprints for Finance

2020-11-06 · O'Reilly Data Science Books O'Reilly Amazon

book

by Sahil Puri , Hariom Tatsat , Brad Lookabaugh

AI/ML Data Science NLP Scikit-learn data data-science data-science-domains sector-specific-data-science

Over the next few decades, machine learning and data science will transform the finance industry. With this practical book, analysts, traders, researchers, and developers will learn how to build machine learning algorithms crucial to the industry. You'll examine ML concepts and over 20 case studies in supervised, unsupervised, and reinforcement learning, along with natural language processing (NLP). Ideal for professionals working at hedge funds, investment and retail banks, and fintech firms, this book also delves deep into portfolio management, algorithmic trading, derivative pricing, fraud detection, asset price prediction, sentiment analysis, and chatbot development. You'll explore real-life problems faced by practitioners and learn scientifically sound solutions supported by code and examples. This book covers: Supervised learning regression-based models for trading strategies, derivative pricing, and portfolio management Supervised learning classification-based models for credit default risk prediction, fraud detection, and trading strategies Dimensionality reduction techniques with case studies in portfolio management, trading strategy, and yield curve construction Algorithms and clustering techniques for finding similar objects, with case studies in trading strategies and portfolio management Reinforcement learning models and techniques used for building trading strategies, derivatives hedging, and portfolio management NLP techniques using Python libraries such as NLTK and scikit-learn for transforming text into meaningful representations

Practical Python Data Visualization: A Fast Track Approach To Learning Data Visualization With Python

2020-10-24 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Ashwin Pajankar

Data Science DataViz Matplotlib NumPy Pandas programming-languages software-development

Quickly start programming with Python 3 for data visualization with this step-by-step, detailed guide. This book’s programming-friendly approach using libraries such as leather, NumPy, Matplotlib, and Pandas will serve as a template for business and scientific visualizations. You’ll begin by installing Python 3, see how to work in Jupyter notebook, and explore Leather, Python’s popular data visualization charting library. You’ll also be introduced to the scientific Python 3 ecosystem and work with the basics of NumPy, an integral part of that ecosystem. Later chapters are focused on various NumPy routines along with getting started with Scientific Data visualization using matplotlib. You’ll review the visualization of 3D data using graphs and networks and finish up by looking at data visualization with Pandas, including the visualization of COVID-19 data sets. The code examples are tested on popular platforms like Ubuntu, Windows, and Raspberry Pi OS. WithPractical Python Data Visualization you’ll master the core concepts of data visualization with Pandas and the Jupyter notebook interface. What You'll Learn Review practical aspects of Python Data Visualization with programming-friendly abstractions Install Python 3 and Jupyter on multiple platforms including Windows, Raspberry Pi, and Ubuntu Visualize COVID-19 data sets with Pandas Who This Book Is For Data Science enthusiasts and professionals, Business analysts and managers, software engineers, data engineers.

Data Engineering with Python

2020-10-23 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Paul Crickard

Analytics Data Engineering data data-engineering

Discover the inner workings of data pipelines with 'Data Engineering with Python', a practical guide to mastering the art of data engineering. Through hands-on examples, you'll explore the process of designing data models, implementing data pipelines, and automating data flows, all within the context of Python. What this Book will help me do Understand the fundamentals of designing data architectures and capturing data requirements. Extract, clean, and transform data from various sources, refining it for precise applications. Implement end-to-end data pipelines, including staging, validation, and production deployment. Leverage Python to connect with databases, perform data manipulations, and build analytics workflows. Monitor and log data pipelines to ensure smooth, real-time operations and high quality. Author(s) Paul Crickard is a seasoned expert in data engineering and analytics, bringing years of practical experience to this technical guide. His unique ability to make complex technical concepts accessible makes this book invaluable for learners and professionals alike. A lifelong technologist, Paul focuses on actionable skills and building confidence to work with data pipelines and models. Who is it for? This book is ideal for aspiring data engineers, data analysts aiming to elevate their technical skillsets, or IT professionals transitioning into data-driven roles. Whether you're just stepping into the field or enhance your Python-based data capabilities, this book is tailored to provide solid grounding and practical expertise. Beginners in data engineering will find it accessible and easy to get started, while those refreshing their knowledge will benefit from its focused projects.

Artificial Intelligence in Finance

2020-10-14 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Yves Hilpisch

AI/ML Data Science ai-ml artificial-intelligence-ai artificial intelligence (ai) data

The widespread adoption of AI and machine learning is revolutionizing many industries today. Once these technologies are combined with the programmatic availability of historical and real-time financial data, the financial industry will also change fundamentally. With this practical book, you'll learn how to use AI and machine learning to discover statistical inefficiencies in financial markets and exploit them through algorithmic trading. Author Yves Hilpisch shows practitioners, students, and academics in both finance and data science practical ways to apply machine learning and deep learning algorithms to finance. Thanks to lots of self-contained Python examples, you'll be able to replicate all results and figures presented in the book. In five parts, this guide helps you: Learn central notions and algorithms from AI, including recent breakthroughs on the way to artificial general intelligence (AGI) and superintelligence (SI) Understand why data-driven finance, AI, and machine learning will have a lasting impact on financial theory and practice Apply neural networks and reinforcement learning to discover statistical inefficiencies in financial markets Identify and exploit economic inefficiencies through backtesting and algorithmic trading--the automated execution of trading strategies Understand how AI will influence the competitive dynamics in the financial industry and what the potential emergence of a financial singularity might bring about

Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing

2020-10-13 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ryan Wade (Blue Granite)

AI/ML Analytics BI Data Analytics Data Science IBM Microsoft Power BI SQL business-intelligence data data-science +2 more

This easy-to-follow guide provides R and Python recipes to help you learn and apply the top languages in the field of data analytics to your work in Microsoft Power BI. Data analytics expert and author Ryan Wade shows you how to use R and Python to perform tasks that are extremely hard, if not impossible, to do using native Power BI tools. For example, you will learn to score Power BI data using custom data science models and powerful models from Microsoft Cognitive Services. The R and Python languages are powerful complements to Power BI. They enable advanced data transformation techniques that are difficult to perform in Power BI in its default configuration but become easier by leveraging the capabilities of R and Python. If you are a business analyst, data analyst, or a data scientist who wants to push Power BI and transform it from being just a business intelligence tool into an advanced data analytics tool, then this is the book to help you do that. What You Will Learn Create advanced data visualizations via R using the ggplot2 package Ingest data using R and Python to overcome some limitations of Power Query Apply machine learning models to your data using R and Python without the need of Power BI premium capacity Incorporate advanced AI in Power BI without the need of Power BI premium capacity via Microsoft Cognitive Services, IBM Watson Natural Language Understanding, and pre-trained models in SQL Server Machine Learning Services Perform advanced string manipulations not otherwise possible in Power BI using R and Python Who This Book Is For Power users, data analysts, and data scientists who want to go beyond Power BI’s built-in functionality to create advanced visualizations, transform data in ways not otherwise supported, and automate data ingestion from sources such as SQL Server and Excel in a more concise way

Learn Data Science Using SAS Studio: A Quick-Start Guide

2020-10-01 · O'Reilly Data Science Books O'Reilly Amazon

book

by Engy Fouda

Analytics Data Science Marketing SAS analytics-platforms data data-science

Do you want to create data analysis reports without writing a line of code? This book introduces SAS Studio, a free data science web browser-based product for educational and non-commercial purposes. The power of SAS Studio comes from its visual point-and-click user interface that generates SAS code. It is easier to learn SAS Studio than to learn R and Python to accomplish data cleaning, statistics, and visualization tasks. The book includes a case study about analyzing the data required for predicting the results of presidential elections in the state of Maine for 2016 and 2020. In addition to the presidential elections, the book provides real-life examples including analyzing stocks, oil and gold prices, crime, marketing, and healthcare. You will see data science in action and how easy it is to perform complicated tasks and visualizations in SAS Studio.You will learn, step-by-step, how to do visualizations, including maps. In most cases, you will not need a line of code as you work with the SAS Studio graphical user interface. The book includes explanations of the code that SAS Studio generates automatically. You will learn how to edit this code to perform more complicated advanced tasks. The book introduces you to multiple SAS products such as SAS Viya, SAS Analytics, and SAS Visual Statistics. What You Will Learn Become familiar with SAS Studio IDE Understand essential visualizations Know the fundamental statistical analysis required in most data science and analytics reports Clean the most common data set problems Use linear progression for data prediction Write programs in SAS Get introduced to SAS-Viya, which is more potent than SAS studio Who This Book Is For A general audience of people who are new to data science, students, and data analysts and scientists who are experiencedbut new to SAS. No programming or in-depth statistics knowledge is needed.

Distributed In Memory Processing And Streaming With Hazelcast

2020-09-15 · Data Engineering Podcast Listen

podcast_episode

by Dale Kim (Hazelcast) , Tobias Macey

Flink Big Data Cloud Computing Data Engineering Data Management IBM Kubernetes Spark Data Streaming

Summary In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Hazelcast is and its origins? What are the benefits and tradeoffs of in-memory computation for data-intensive workloads? What are some of the common use cases for the Hazelcast in memory grid? How is Hazelcast implemented?

How has the architecture evolved since it was first created?

How is the Jet streaming framework architected?

What was the motivation for building it? How do the capabilities of Jet compare to systems such as Flink or Spark Streaming?

How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems? How is the governance of the open source grid and Jet projects handled?

What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?

What is involved in building an application or workflow on top of Hazelcast? What are the common patterns for engineers who are building on top of Hazelcast? What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming? What are the scaling factors for Hazelcast?

What are the edge cases that users should be aware of?

What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used? When is Hazelcast Grid or Jet the wrong choice? What is in store for the future of Hazelcast?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

HazelCast Istanbul Apache Spark OrientDB CAP Theorem NVMe Memristors Intel Optane Persistent Memory Hazelcast Jet Kappa Architecture IBM Cloud Paks Digital Integration Hub (Gartner)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Self-Service AI with Power BI Desktop: Machine Learning Insights for Business

2020-09-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Markus Ehrenmueller-Jensen

AI/ML Azure BI Cloud Computing DAX Power BI business-intelligence data data-science microsoft-power-platform power-bi

This book explains how you can enrich the data you have loaded into Power BI Desktop by accessing a suite of Artificial Intelligence (AI) features. These AI features are built into Power BI Desktop and help you to gain new insights from existing data. Some of the features are automated and are available to you at the click of a button or through writing Data Analysis Expressions (DAX). Other features are available through writing code in either the R, Python, or M languages. This book opens up the entire suite of AI features to you with clear examples showing when they are best applied and how to invoke them on your own datasets. No matter if you are a business user, analyst, or data scientist – Power BI has AI capabilities tailored to you. This book helps you learn what types of insights Power BI is capable of delivering automatically. You will learn how to integrate and leverage the use of the R and Python languages for statistics, how to integrate with Cognitive Services andAzure Machine Learning Services when loading data, how to explore your data by asking questions in plain English ... and more! There are AI features for discovering your data, characterizing unexplored datasets, and building what-if scenarios. There’s much to like and learn from this book whether you are a newcomer to Power BI or a seasoned user. Power BI Desktop is a freely available tool for visualization and analysis. This book helps you to get the most from that tool by exploiting some of its latest and most advanced features. What You Will Learn Ask questions in natural language and get answers from your data Let Power BI explain why a certain data point differs from the rest Have Power BI show key influencers over categories of data Access artificial intelligence features available in the Azure cloud Walk the same drill down path in different parts of your hierarchy Load visualizations to add smartness to your reports Simulate changes in data and immediately see the consequences Know your data, even before you build your first report Create new columns by giving examples of the data that you need Transform and visualize your data with the help of R and Python scripts Who This Book Is For For the enthusiastic Power BI user who wants to apply state-of-the-art artificial intelligence (AI) features to gain new insights from existing data. For end-users and IT professionals who are not shy of jumping into a new world of machine learning and are ready to make that step and take a deeper look into their data. For those wanting to step up their game from doing simple reporting and visualizations by making the move into diagnostic and predictive analysis.

Learn MongoDB 4.x

2020-09-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Doug Bierer

MongoDB NoSQL Cyber Security data data-engineering nosql-databases

Explore the capabilities of MongoDB 4.x with this comprehensive guide designed for developers and administrators working with NoSQL databases. Dive into topics such as database design, advanced query handling, and security configuration, and gain hands-on experience through practical examples and insights. What this Book will help me do Learn to configure and install MongoDB 4.x for development and administration. Understand the principles of NoSQL schema design for optimal performance. Perform complex queries and operations to manage your MongoDB databases. Secure your MongoDB setup with role-based access control and encryption techniques. Monitor and optimize database performance for production environments. Author(s) None Bierer, the author of 'Learn MongoDB 4.x,' is a seasoned database expert with extensive experience in NoSQL technologies. With a focus on practicality and clear explanations, None brings deep insights into MongoDB's development and administration. Who is it for? This book is ideal for early-career developers, system administrators, and database enthusiasts eager to break into NoSQL technologies. If you are familiar with Python and basic database concepts, this book will guide you through mastering MongoDB. It's perfect for those building dynamic backend systems.

The Data Science Workshop - Second Edition

2020-08-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Anthony So , Thomas Joseph , Andrew Worsley , Dr. Samuel Asare , Robert Thas John

AI/ML Data Science Pandas Scikit-learn data data-science

The Data Science Workshop provides a comprehensive introduction to building real-world data science projects. Through a hands-on approach, you will learn how to analyze data, build machine learning models, and deploy them effectively in various scenarios. This book is designed to equip you with the skills to confidently tackle data science challenges. What this Book will help me do Understand the differences between supervised and unsupervised learning to select the appropriate technique. Master data manipulation and analysis using popular Python libraries like pandas and scikit-learn. Develop skills in regression, classification, and clustering to solve diverse data science problems. Learn advanced methods to improve model accuracy, including hyperparameter tuning and feature engineering. Implement and deploy machine learning models efficiently in production workflows. Author(s) The authors of The Data Science Workshop are experienced professionals and educators in the field of data science and machine learning. They have extensive expertise in using practical methods to solve data challenges and have a passion for teaching others through engaging and clear instructional material. Who is it for? This book is ideal for aspiring data analysts, data scientists, and business analysts who wish to build foundational skills in data science. It caters to those new to the field and professionals transitioning to a data-centric role, providing practical knowledge without requiring an advanced mathematical background. Familiarity with Python is recommended.

Hands-on Time Series Analysis with Python: From Basics to Bleeding Edge Techniques

2020-08-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by ASHISH PATEL , B V Vishwas

RNNs TensorFlow data data-science data-science-tasks statistics time-series

Learn the concepts of time series from traditional to bleeding-edge techniques. This book uses comprehensive examples to clearly illustrate statistical approaches and methods of analyzing time series data and its utilization in the real world. All the code is available in Jupyter notebooks. You'll begin by reviewing time series fundamentals, the structure of time series data, pre-processing, and how to craft the features through data wrangling. Next, you'll look at traditional time series techniques like ARMA, SARIMAX, VAR, and VARMA using trending framework like StatsModels and pmdarima. The book also explains building classification models using sktime, and covers advanced deep learning-based techniques like ANN, CNN, RNN, LSTM, GRU and Autoencoder to solve time series problem using Tensorflow. It concludes by explaining the popular framework fbprophet for modeling time series analysis. After reading Hands-On Time Series Analysis with Python, you'll be able to apply these new techniques in industries, such as oil and gas, robotics, manufacturing, government, banking, retail, healthcare, and more. What You'll Learn: · Explains basics to advanced concepts of time series · How to design, develop, train, and validate time-series methodologies · What are smoothing, ARMA, ARIMA, SARIMA,SRIMAX, VAR, VARMA techniques in time series and how to optimally tune parameters to yield best results · Learn how to leverage bleeding-edge techniques such as ANN, CNN, RNN, LSTM, GRU, Autoencoder to solve both Univariate and multivariate problems by using two types of data preparation methods for time series. · Univariate and multivariate problem solving using fbprophet. Who This Book Is For Data scientists, data analysts, financial analysts, and stock market researchers

Inventory Optimization

2020-08-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Nicolas Vandeput

data data-science

In this book . . . Nicolas Vandeput hacks his way through the maze of quantitative supply chain optimizations. This book illustrates how the quantitative optimization of 21st century supply chains should be crafted and executed. . . . Vandeput is at the forefront of a new and better way of doing supply chains, and thanks to a richly illustrated book, where every single situation gets its own illustrating code snippet, so could you. --Joannes Vermorel, CEO, Lokad Inventory Optimization argues that mathematical inventory models can only take us so far with supply chain management. In order to optimize inventory policies, we have to use probabilistic simulations. The book explains how to implement these models and simulations step-by-step, starting from simple deterministic ones to complex multi-echelon optimization. The first two parts of the book discuss classical mathematical models, their limitations and assumptions, and a quick but effective introduction to Python is provided. Part 3 contains more advanced models that will allow you to optimize your profits, estimate your lost sales and use advanced demand distributions. It also provides an explanation of how you can optimize a multi-echelon supply chain based on a simple—yet powerful—framework. Part 4 discusses inventory optimization thanks to simulations under custom discrete demand probability functions. Inventory managers, demand planners and academics interested in gaining cost-effective solutions will benefit from the "do-it-yourself" examples and Python programs included in each chapter. Events around the book Link to a De Gruyter Online Event in which the author Nicolas Vandeput together with Stefan de Kok, supply chain innovator and CEO of Wahupa; Koen Cobbaert, Director in the S&O Industry practice of PwC Belgium; Bram Desmet, professor of operations & supply chain at the Vlerick Business School in Ghent; and Karl-Eric Devaux, Planning Consultant, Hatmill, discuss about models for inventory optimization. The event will be moderated by Eric Wilson, Director of Thought Leadership for Institute of Business Forecasting (IBF): https://youtu.be/565fDQMJEEg

talk-data.com

Activity Trend

Top Events

Top Speakers

Enabling Version Controlled Data Collaboration With TerminusDB

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Off The Shelf Data Governance With Satori

Low Friction Data Governance With Immuta

Presenting: SQLFluff

Episode 2: Our Favorite Data Structures

Essential Statistics for Non-STEM Data Analysts

Python for Algorithmic Trading

Machine Learning and Data Science Blueprints for Finance

Practical Python Data Visualization: A Fast Track Approach To Learning Data Visualization With Python

Data Engineering with Python

Artificial Intelligence in Finance

Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing

Learn Data Science Using SAS Studio: A Quick-Start Guide

Distributed In Memory Processing And Streaming With Hazelcast

Self-Service AI with Power BI Desktop: Machine Learning Insights for Business

Learn MongoDB 4.x

The Data Science Workshop - Second Edition

Hands-on Time Series Analysis with Python: From Basics to Bleeding Edge Techniques

Inventory Optimization