Python

Evolving An ETL Pipeline For Better Productivity

2019-06-04 · Data Engineering Podcast Listen

podcast_episode

by Raghu Murthy (DataCoral) , Aaron Gibralter (Greenhouse) , Tobias Macey

AI/ML API Big Data Data Engineering Data Management Data Science Datacoral ETL/ELT Data Streaming

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other

Principles of Strategic Data Science

2019-06-03 · O'Reilly Data Science Books O'Reilly Amazon

book

by Peter Prevos

Data Science data data-science

"Principles of Strategic Data Science" is your go-to guide for creating measurable value from data through strategic use of tools and techniques. This book takes you through key theoretical foundations, practical tools, and the managerial perspective necessary to succeed in data science. What this Book will help me do Master the five-phase framework for strategic data science. Learn ways to effectively visualize data information. Explore the role and contributions of a data science manager. Gain clear insights into organizational benefits of data science. Understand the ethical and mathematical boundaries of data analysis. Author(s) Peter Prevos is an accomplished engineer and social scientist with extensive expertise in data science applications. He combines technical insights with social science management practices to design effective data strategies. Known for his clear teaching style, Peter helps professionals integrate theory with practical planning. Who is it for? This book is ideal for data scientists and analysts seeking to deepen their strategic understanding of data science. It's well-suited for intermediate professionals looking to gain insights into data-driven decision making. Readers should have basic programming knowledge in Python or R. Novice managers eager to harness data for organizational goals will also find it valuable.

Geospatial Data Science Quick Start Guide

2019-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jayakrishnan Vijayaraghavan , Abdishakur Hassan

Data Science GIS data data-engineering geographic-information-system-gis geographic information system (gis) location-data

"Geospatial Data Science Quick Start Guide" provides a practical and effective introduction to leveraging geospatial data in data science. In this book, you will learn techniques for analyzing location-based data, building intelligent models, and performing geospatial operations for various applications. What this Book will help me do Understand the principles and techniques for analyzing geospatial data. Set up Python tools to work effectively with location intelligence. Perform advanced spatial operations such as geocoding and proximity analysis. Develop systems such as geofencing and location-based recommendation engines. Obtain actionable insights by visualizing and processing spatial data effectively. Author(s) Abdishakur Hassan and Jayakrishnan Vijayaraghavan are experts in geospatial analysis. With extensive experience in applying data science to location intelligence, they bring a practical and hands-on approach to coding, teaching, and problem-solving. They are passionate about sharing their knowledge through their clear explanations and structured learning paths. Who is it for? This book is ideal for data scientists interested in integrating geospatial analysis into their models and workflows. It is also suitable for GIS developers looking to enhance existing systems with advanced data analysis capabilities. Readers should have experience with Python and a basic understanding of data science concepts. If location-based data intrigues you, this book is your guide.

Machine Learning for Finance

2019-05-30 · O'Reilly AI & ML Books O'Reilly Amazon

book

by James Le (Twelve Labs) , Jannes Klaas

AI/ML Keras Scikit-learn TensorFlow ai-ml data machine-learning

Dive deep into how machine learning is transforming the financial industry with 'Machine Learning for Finance'. This comprehensive guide explores cutting-edge concepts in machine learning while providing practical insights and Python code examples to help readers apply these techniques to real-world financial scenarios. Whether tackling fraud detection, financial forecasting, or sentiment analysis, this book equips you with the understanding and tools needed to excel. What this Book will help me do Understand and implement machine learning techniques for structured data, natural language, images, and text. Learn Python-based tools and libraries such as scikit-learn, Keras, and TensorFlow for financial data analysis. Apply machine learning for tasks like predicting financial trends, detecting fraud, and customer sentiment analysis. Explore advanced topics such as neural networks, generative adversarial networks (GANs), and reinforcement learning. Gain hands-on experience with machine learning debugging, products launch preparation, and addressing bias in data. Author(s) James Le None and Jannes Klaas are experts in machine learning applications in financial technology. Jannes has extensive experience training financial professionals on implementing machine learning strategies in their work and pairs this with a deep academic understanding of the topic. Their dedication to empowering readers to confidently integrate AI and machine learning into financial applications shines through in this user-focused, richly detailed book. Who is it for? This book is tailored for financial professionals, data scientists, and enthusiasts aiming to harness machine learning's potential in finance. Readers should have a foundational understanding of mathematics, statistics, and Python programming. If you work in financial services and are curious about applications ranging from fraud detection to trend forecasting, this resource is for you. It's designed for those looking to advance their skills and make impactful contributions in financial technology.

Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud

2019-05-10 · O'Reilly Data Science Books O'Reilly Amazon

book

by Paul J. Deitel , Harvey M. Deitel

AI/ML Big Data Cloud Computing Computer Science Data Science programming-languages software-development

This is the eBook of the printed book and may not include any media, website access codes, or print supplements that may come packaged with the bound book. For introductory-level Python programming and/or data-science courses. A groundbreaking, flexible approach to computer science and data science The Deitels’ Introduction to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud offers a unique approach to teaching introductory Python programming, appropriate for both computer-science and data-science audiences. Providing the most current coverage of topics and applications, the book is paired with extensive traditional supplements as well as Jupyter Notebooks supplements. Real-world datasets and artificial-intelligence technologies allow students to work on projects making a difference in business, industry, government and academia. Hundreds of examples, exercises, projects (EEPs), and implementation case studies give students an engaging, challenging and entertaining introduction to Python programming and hands-on data science. Related Content Video: Python Fundamentals Live courses: Python Full Throttle with Paul Deitel: A One-Day, Fast-Paced, Code-Intensive Python Presentation Python® Data Science Full Throttle with Paul Deitel: Introductory Artificial Intelligence (AI), Big Data and Cloud Case Studies The book’s modular architecture enables instructors to conveniently adapt the text to a wide range of computer-science and data-science courses offered to audiences drawn from many majors. Computer-science instructors can integrate as much or as little data-science and artificial-intelligence topics as they’d like, and data-science instructors can integrate as much or as little Python as they’d like. The book aligns with the latest ACM/IEEE CS-and-related computing curriculum initiatives and with the Data Science Undergraduate Curriculum Proposal sponsored by the National Science Foundation.

Data Science from Scratch, 2nd Edition

2019-05-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Joel Grus

AI/ML Data Science NLP data data-science

To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, and toolkits—but also understand the ideas and principles underlying them. Updated for Python 3.6, this second edition of Data Science from Scratch shows you how these tools and algorithms work by implementing them from scratch. If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with the hacking skills you need to get started as a data scientist. Packed with new material on deep learning, statistics, and natural language processing, this updated book shows you how to find the gems in today’s messy glut of data. Get a crash course in Python Learn the basics of linear algebra, statistics, and probability—and how and when they’re used in data science Collect, explore, clean, munge, and manipulate data Dive into the fundamentals of machine learning Implement models such as k-nearest neighbors, Naïve Bayes, linear and logistic regression, decision trees, neural networks, and clustering Explore recommender systems, natural language processing, network analysis, MapReduce, and databases

Data Science Projects with Python

2019-04-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Stephen Klosterman

AI/ML Data Science Matplotlib Pandas Scikit-learn programming-languages software-development

Data Science Projects with Python introduces you to data science and machine learning using Python through practical examples. In this book, you'll learn to analyze, visualize, and model data, applying techniques like logistic regression and random forests. With a case-study method, you'll build confidence implementing insights in real-world scenarios. What this Book will help me do Set up a data science environment with necessary Python libraries such as pandas and scikit-learn. Effectively visualize data insights through Matplotlib and summary statistics. Apply machine learning models including logistic regression and random forests to solve data problems. Identify optimal models through evaluation metrics like k-fold cross-validation. Develop confidence in data preparation and modeling techniques for real-world data challenges. Author(s) Stephen Klosterman is a seasoned data scientist with a keen interest in practical applications of machine learning. He combines a strong academic foundation with real-world experience to craft relatable content. Stephen excels in breaking down complex topics into approachable lessons, helping learners grow their data science expertise step by step. Who is it for? This book is ideal for data analysts, scientists, and business professionals looking to enhance their skills in Python and data science. If you have some experience in Python and a foundational understanding of algebra and statistics, you'll find this book approachable. It offers an excellent gateway to mastering advanced data analysis techniques. Whether you're seeking to explore machine learning or apply data insights, this book supports your growth.

Data Science and Engineering at Enterprise Scale

2019-04-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jerome Nilmeier

AI/ML Analytics Data Science Spark SQL Data Streaming TensorFlow data data-science

As enterprise-scale data science sharpens its focus on data-driven decision making and machine learning, new tools have emerged to help facilitate these processes. This practical ebook shows data scientists and enterprise developers how the notebook interface, Apache Spark, and other collaboration tools are particularly well suited to bridge the communication gap between their teams. Through a series of real-world examples, author Jerome Nilmeier demonstrates how to generate a model that enables data scientists and developers to share ideas and project code. You’ll learn how data scientists can approach real-world business problems with Spark and how developers can then implement the solution in a production environment. Dive deep into data science technologies, including Spark, TensorFlow, and the Jupyter Notebook Learn how Spark and Python notebooks enable data scientists and developers to work together Explore how the notebook environment works with Spark SQL for structured data Use notebooks and Spark as a launchpad to pursue supervised, unsupervised, and deep learning data models Learn additional Spark functionality, including graph analysis and streaming Explore the use of analytics in the production environment, particularly when creating data pipelines and deploying code

Data Science Using Python and R

2019-04-09 · O'Reilly Data Science Books O'Reilly Amazon

book

by Chantal D. Larose , Daniel T. Larose

Analytics Data Science data data-science

Learn data science by doing data science! Data Science Using Python and R will get you plugged into the world’s two most widespread open-source platforms for data science: Python and R. Data science is hot. Bloomberg called data scientist “the hottest job in America.” Python and R are the top two open-source data science tools in the world. In Data Science Using Python and R, you will learn step-by-step how to produce hands-on solutions to real-world business problems, using state-of-the-art techniques. Data Science Using Python and R is written for the general reader with no previous analytics or programming experience. An entire chapter is dedicated to learning the basics of Python and R. Then, each chapter presents step-by-step instructions and walkthroughs for solving data science problems using Python and R. Those with analytics experience will appreciate having a one-stop shop for learning how to do data science using Python and R. Topics covered include data preparation, exploratory data analysis, preparing to model the data, decision trees, model evaluation, misclassification costs, naïve Bayes classification, neural networks, clustering, regression modeling, dimension reduction, and association rules mining. Further, exciting new topics such as random forests and general linear models are also included. The book emphasizes data-driven error costs to enhance profitability, which avoids the common pitfalls that may cost a company millions of dollars. Data Science Using Python and R provides exercises at the end of every chapter, totaling over 500 exercises in the book. Readers will therefore have plenty of opportunity to test their newfound data science skills and expertise. In the Hands-on Analysis exercises, readers are challenged to solve interesting business problems using real-world data sets.

#59 Data Science R&D at TD Ameritrade

2019-04-01 · DataFramed Listen

podcast_episode

by Sean Law (TD Ameritrade) , Hugo (DataCamp)

Data Science Git GitHub RNNs

This week, Hugo speaks with Sean Law about data science research and development at TD Ameritrade. Sean’s work on the Exploration team uses cutting edge theories and tools to build proofs of concept. At TD Ameritrade they think about a wide array of questions from conversational agents that can help customers quickly get to information that they need and going beyond chatbots. They use modern time series analysis and more advanced techniques like recurrent neural networks to predict the next time a customer might call and what they might be calling about, as well as helping investors leverage alternative data sets and make more informed decisions.

What does this proof of concept work on the edge of data science look like at TD Ameritrade and how does it differ from building prototypes and products? And How does exploration differ from production? Stick around to find out.

LINKS FROM THE SHOW

DATAFRAMED GUEST SUGGESTIONS

DataFramed Guest Suggestions (who do you want to hear on DataFramed?)

FROM THE INTERVIEW

Sean on TwitterSean's WebsiteTD Ameritrade Careers PagePyData Ann Arbor MeetupPyData Ann Arbor YouTube Channel (Videos)TDA Github Account (Time Series Pattern Matching repo to be open sourced in the coming months)Aura Shows Human Fingerprint on Global Air Quality

FROM THE SEGMENTS

Guidelines for A/B Testing (with Emily Robinson ~19:20)

Guidelines for A/B Testing (By Emily Robinson)10 Guidelines for A/B Testing Slides (By Emily Robinson)

Data Science Best Practices (with Ben Skrainka ~34:50)

Debugging (By David J. Agans)Basic Debugging With GDB (By Ben Skrainka)Sneaky Bugs and How to Find Them (with git bisect) (By Wiktor Czajkowski)Good logging practice in Python (By Victor Lin)

Original music and sounds by The Sticks.

Data Science for Marketing Analytics

2019-03-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Pranshu Bhatnagar , Tommy Blanchard , Debasish Behera

AI/ML Analytics Data Science Marketing Matplotlib Pandas data data-science

Data Science for Marketing Analytics introduces you to leveraging state-of-the-art data science techniques to optimize marketing outcomes. You'll learn how to manipulate and analyze data using Python, create customer segments, and apply machine learning algorithms to predict customer behavior. This book provides a comprehensive, hands-on approach to marketing analytics. What this Book will help me do Learn to use Python libraries like pandas & Matplotlib for data analysis. Understand clustering techniques to create meaningful customer segments. Implement linear regression for predicting customer lifetime value. Explore classification algorithms to model customer preferences. Develop skills to build interactive dashboards for marketing reports. Author(s) None Blanchard, Nona Behera, and Pranshu Bhatnagar are experienced professionals in data science and marketing analytics, with extensive backgrounds in applying machine learning to real-world business applications. They bring a wealth of knowledge and an approachable teaching style to this book, focusing on practical, industry-relevant applications for learners. Who is it for? This book is for developers and marketing professionals looking to advance their analytics skills. It is ideal for individuals with a basic understanding of Python and mathematics who want to explore predictive modeling and segmentation strategies. Readers should have a curiosity for data-driven problem-solving in marketing contexts to benefit most from the content.

Hands-On Data Science for Marketing

2019-03-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Yoon Hyup Hwang

AI/ML Analytics Data Science KPI Marketing data data-science

The book "Hands-On Data Science for Marketing" equips readers with the tools and insights to optimize their marketing campaigns using data science and machine learning techniques. Using practical examples in Python and R, you will learn how to analyze data, predict customer behavior, and implement effective strategies for better customer engagement and retention. What this Book will help me do Understand marketing KPIs and learn to compute and visualize them in Python and R. Develop the ability to analyze customer behavior and predict potential high-value customers. Master machine learning concepts for customer segmentation and personalized marketing strategies. Improve your skills to forecast customer engagement and lifetime value for more effective planning. Learn the techniques of A/B testing and their application in refining marketing decisions. Author(s) Yoon Hyup Hwang is a seasoned data scientist with a deep interest in the intersection of marketing and technology. With years of expertise in implementing machine learning algorithms in marketing analytics, Yoon brings a unique perspective by blending technical insights with business strategy. As an educator and practitioner, Yoon's approachable style and clear explanations make complex topics accessible for all learners. Who is it for? This book is tailored for marketing professionals looking to enhance their strategies using data science, data enthusiasts eager to apply their skills in marketing, and students or engineers seeking to expand their knowledge in this domain. A basic understanding of Python or R is beneficial, but the book is structured to welcome beginners by covering foundational to advanced concepts in a practical way.

Mastering Geospatial Development with QGIS 3.x - Third Edition

2019-03-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Luigi Pirelli , Richard Smith Jr., GISP , Simon Miles , Kurt Menke, GISP , John Van Hoesen, GISP , Shammunul Islam

GIS data data-engineering geographic-information-system-gis geographic information system (gis) location-data

This book, "Mastering Geospatial Development with QGIS 3.x", is your comprehensive guide to becoming skilled in QGIS, an open-source GIS software. Covering functionalities of QGIS 3.4 and 3.6, you will advance your knowledge in spatial data analysis, styling, and spatial database management through practical examples and in-depth discussions. What this Book will help me do Understand the latest features and updates in QGIS 3.6. Master spatial data styling for impactful geographic visualizations. Learn to create and manage spatial databases and GeoPackages. Automate workflows using QGIS's graphical modeler and Python scripting. Develop custom QGIS plugins to extend its capabilities. Author(s) This book is written by a team of GIS experts with extensive experience in spatial data analysis and QGIS. Authors include professionals with GISP credentials who have taught GIS at various levels. With their deep understanding of QGIS and practical teaching approach, they aim to make premium GIS knowledge accessible to all. Who is it for? The book is ideal for GIS professionals seeking to enhance their QGIS expertise. Beginners looking to establish a firm foundation in GIS and QGIS will also benefit. Developers interested in extending QGIS capabilities using Python will find invaluable guidance here. Whether for career growth, project management, or academic purposes, this book suits users aspiring to excel in geospatial development.

PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes

2019-03-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Raju Kumar Mishra , Sundar Rajan Raman

NoSQL PySpark SQL Data Streaming apache-spark data data-engineering

Carry out data analysis with PySpark SQL, graphframes, and graph data processing using a problem-solution approach. This book provides solutions to problems related to dataframes, data manipulation summarization, and exploratory analysis. You will improve your skills in graph data analysis using graphframes and see how to optimize your PySpark SQL code. PySpark SQL Recipes starts with recipes on creating dataframes from different types of data source, data aggregation and summarization, and exploratory data analysis using PySpark SQL. You’ll also discover how to solve problems in graph analysis using graphframes. On completing this book, you’ll have ready-made code for all your PySpark SQL tasks, including creating dataframes using data from different file formats as well as from SQL or NoSQL databases. What You Will Learn Understand PySpark SQL and its advanced features Use SQL and HiveQL with PySpark SQL Work with structured streaming Optimize PySpark SQL Master graphframes and graph processing Who This Book Is For Data scientists, Python programmers, and SQL programmers.

Hands-On Business Intelligence with Qlik Sense

2019-02-28 · O'Reilly Business Intelligence Books O'Reilly Amazon

book

by Clever Anjos , Kaushik Solanki , Pablo Labbe , Jerry DiMaso

Analytics BI Data Modelling Qlik analytics-platforms data data-science qlik-sense

"Hands-On Business Intelligence with Qlik Sense" teaches you how to harness the powerful capabilities of Qlik Sense to build dynamic, interactive dashboards and analyze data effectively. This book provides comprehensive guidance, from data modeling to creating visualizations, geospatial analysis, forecasting, and sharing insights across your organization. What this Book will help me do Understand the core concepts of Qlik Sense for building business intelligence dashboards. Master the process of loading, reshaping, and modeling data for analysis and reporting. Create impactful visual representations of data using Qlik Sense visualization tools. Leverage advanced analytics techniques, including Python and R integration, for deeper insights. Utilize Qlik Sense GeoAnalytics to perform geospatial analysis and produce location-based insights. Author(s) The authors of "Hands-On Business Intelligence with Qlik Sense" are experts in Qlik Sense and data analysis. They collectively bring decades of experience in business intelligence development and implementation. Their practical approach ensures that readers not only learn the theory but can also apply the techniques in real-world scenarios. Who is it for? This book is designed for business intelligence developers, data analysts, and anyone interested in exploring Qlik Sense for their data analysis tasks. If you're aiming to start with Qlik Sense and want a practical and hands-on guide, this book is ideal. No prior experience with Qlik Sense is necessary, but familiarity with data analysis concepts is helpful.

Mastering Tableau 2019.1 - Second Edition

2019-02-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by David Baldwin , Marleen Meier

Analytics BI Dashboard DataViz MATLAB Tableau data data-science data-science-tasks data-visualization

Mastering Tableau 2019.1 is your essential guide for becoming an expert in Tableau's advanced features and functionalities. This book will teach you how to use Tableau Prep for data preparation, create complex visualizations and dashboards, and leverage Tableau's integration with R, Python, and MATLAB. You'll be equipped with the skills to solve both common and advanced BI challenges. What this Book will help me do Gain expertise in preparing and blending data using Tableau Prep and other data handling tools. Create advanced data visualizations and designs that effectively communicate insights. Implement narrative storytelling in BI with advanced presentation designs in Tableau. Integrate Tableau with programming tools like R, Python, and MATLAB for extended functionalities. Optimize performance and improve dashboard interactivity for user-friendly analytics solutions. Author(s) Marleen Meier, with extensive experience in business intelligence and analytics, and None Baldwin, an expert in data visualization, collaboratively bring this advanced Tableau guide to life. Their passion for empowering users with practical BI solutions reflects in the hands-on approach employed throughout the book. Who is it for? This book is perfectly suited for business analysts, BI professionals, and data analysts who already have foundational knowledge of Tableau and seek to advance their skills for tackling more complex BI challenges. It's ideal for individuals aiming to master Tableau's premium features for impactful analytics solutions.

Python for Data Science For Dummies, 2nd Edition

2019-02-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by John Paul Mueller , Luca Massaron

Cloud Computing Data Science data data-science

The fast and easy way to learn Python programming and statistics Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library. Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud. Get started with data science and Python Visualize information Wrangle data Learn from data The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.

Cleaning And Curating Open Data For Archaeology

2019-02-04 · Data Engineering Podcast Listen

podcast_episode

by Eric Kansa (Open Context) , Tobias Macey

Cloud Computing Data Engineering Data Management ETL/ELT GCP GIS Linux postgresql

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.

Introduction

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data

Interview

Introduction

How did you get involved in the area of data management?

I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.

Can you start by describing what Open Context is and how it started?

Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.

What are your protocols for determining which data sets you will work with?

Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.

What are some of the challenges unique to research data?

What are some of the unique requirements for processing, publishing, and archiving research data?

You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.

Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.

How much education is necessary around your content licensing for researchers who are interested in publishing their data with you?

We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.

Can you describe the system architecture that you use for Open Context?

Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.

Wh

Apache Spark Quick Start Guide

2019-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Akash Grade , Shrey Mehrotra

AI/ML API Big Data Java Scala Spark SQL Data Streaming apache-spark data data-engineering

Dive into the world of scalable data processing with the "Apache Spark Quick Start Guide." This book offers a foundational introduction to Spark, empowering readers to harness its capabilities for big data processing. With clear explanations and hands-on examples, you'll learn to implement Spark applications that handle complex data tasks efficiently. What this Book will help me do Understand and implement Spark's RDDs and DataFrame APIs to process large datasets effectively. Set up a local development environment for Spark-based projects. Develop skills to debug and optimize slow-performing Spark applications. Harness built-in modules of Spark for SQL, streaming, and machine learning applications. Adopt best practices and optimization techniques for high-performance Spark applications. Author(s) Shrey Mehrotra is a seasoned software developer with expertise in big data technologies, particularly Apache Spark. With years of hands-on industry experience, Shrey focuses on making complex technical concepts accessible to all. Through his writing, he aims to share clear, practical guidance for developers of all levels. Who is it for? This guide is perfect for big data enthusiasts and professionals looking to learn Apache Spark's capabilities from scratch. It's aimed at data engineers interested in optimizing application performance and data scientists wanting to integrate machine learning with Spark. A basic familiarity with either Scala, Python, or Java is recommended.

Learning PostgreSQL 11 - Third Edition

2019-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christopher Travers , Andrey Volkov

Data Modelling DWH data data-engineering postgresql relational-databases

Immerse yourself in the capabilities of PostgreSQL 11 with this comprehensive beginner's guide. Learning PostgreSQL 11 will take you through relational database fundamentals and advanced database functionality, empowering you to build efficient and scalable database solutions with confidence. By the end of this book, you'll have mastery over PostgreSQL's features to develop, manage, and optimize your own databases. What this Book will help me do Gain a solid understanding of relational database principles and the PostgreSQL ecosystem. Learn to install PostgreSQL, create a database, and design a data model effectively. Develop skills to create, manipulate, and optimize tables, views, and efficient indexes. Utilize server-side programming with PL/pgSQL and advanced data types like JSONB. Enhance database reliability and performance, and connect to your Python applications seamlessly. Author(s) Christopher Travers and None Volkov bring their collective expertise and practical experience to this book. Christopher has a strong background in software development and database systems, with years of hands-on involvement with PostgreSQL. None has contributed significantly to innovative database solutions, emphasizing clear and actionable instructions. Together, they aim to demystify PostgreSQL for learners of all backgrounds. Who is it for? This book is crafted for developers, database administrators, and tech enthusiasts who want to delve into PostgreSQL. Beginners with no prior database experience will find its approach accessible, while those aiming to enhance their skills with PostgreSQL's latest features will benefit immensely. It's ideal for anyone seeking to build solid database or data warehousing applications with modern capabilities and best practices.

talk-data.com

Activity Trend

Top Events

Top Speakers

Evolving An ETL Pipeline For Better Productivity

Principles of Strategic Data Science

Geospatial Data Science Quick Start Guide

Machine Learning for Finance

Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud

Data Science from Scratch, 2nd Edition

Data Science Projects with Python

Data Science and Engineering at Enterprise Scale

Data Science Using Python and R

#59 Data Science R&D at TD Ameritrade

Data Science for Marketing Analytics

Hands-On Data Science for Marketing

Mastering Geospatial Development with QGIS 3.x - Third Edition

PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes

Hands-On Business Intelligence with Qlik Sense

Mastering Tableau 2019.1 - Second Edition

Python for Data Science For Dummies, 2nd Edition

Cleaning And Curating Open Data For Archaeology

Apache Spark Quick Start Guide

Learning PostgreSQL 11 - Third Edition