Python

QGIS: Becoming a GIS Power User

2017-02-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alex Mandel , Víctor Olaya Ferrero , Ben Mearns , Alexander Bruy , Anita Graser (Austrian Institute of Technology)

Data Management GIS Master Data Management data data-engineering geographic-information-system-gis geographic information system (gis) location-data

Master data management, visualization, and spatial analysis techniques in QGIS and become a GIS power user About This Book Learn how to work with various types of data and create beautiful maps using this easy-to-follow guide Give a touch of professionalism to your maps, both for functionality and look and feel, with the help of this practical guide This progressive, hands-on guide builds on a geo-spatial data and adds more reactive maps using geometry tools. Who This Book Is For If you are a user, developer, or consultant and want to know how to use QGIS to achieve the results you are used to from other types of GIS, then this learning path is for you. You are expected to be comfortable with core GIS concepts. This Learning Path will make you an expert with QGIS by showing you how to develop more complex, layered map applications. It will launch you to the next level of GIS users. What You Will Learn Create your first map by styling both vector and raster layers from different data sources Use parameters such as precipitation, relative humidity, and temperature to predict the vulnerability of fields and crops to mildew Re-project vector and raster data and see how to convert between different style formats Use a mix of web services to provide a collaborative data system Use raster analysis and a model automation tool to model the physical conditions for hydrological analysis Get the most out of the cartographic tools to in QGIS to reveal the advanced tips and tricks of cartography In Detail The first module Learning QGIS, Third edition covers the installation and configuration of QGIS. You'll become a master in data creation and editing, and creating great maps. By the end of this module, you'll be able to extend QGIS with Python, getting in-depth with developing custom tools for the Processing Toolbox. The second module QGIS Blueprints gives you an overview of the application types and the technical aspects along with few examples from the digital humanities. After estimating unknown values using interpolation methods and demonstrating visualization and analytical techniques, the module ends by creating an editable and data-rich map for the discovery of community information. The third module QGIS 2 Cookbook covers data input and output with special instructions for trickier formats. Later, we dive into exploring data, data management, and preprocessing steps to cut your data to just the important areas. At the end of this module, you will dive into the methods for analyzing routes and networks, and learn how to take QGIS beyond the out-of-the-box features with plug-ins, customization, and add-on tools. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: Learning QGIS, Third Edition by Anita Graser QGIS Blueprints by Ben Mearns QGIS 2 Cookbook by Alex Mandel, Víctor Olaya Ferrero, Anita Graser, Alexander Bruy Style and approach This Learning Path will get you up and running with QGIS. We start off with an introduction to QGIS and create maps and plugins. Then, we will guide you through Blueprints for geographic web applications, each of which will teach you a different feature by boiling down a complex workflow into steps you can follow. Finally, you'll turn your attention to becoming a QGIS power user and master data management, visualization, and spatial analysis techniques of QGIS. Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

The Data Science Handbook

2017-02-28 · O'Reilly Data Science Books O'Reilly Amazon

book

by Field Cady

AI/ML Analytics Big Data Computer Science Data Science data data-science

A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline. Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features: • Extensive sample code and tutorials using Python™ along with its technical libraries • Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve real-world problems • Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity • A wide variety of case studies from industry • Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set. FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.

Learning PySpark

2017-02-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Tomasz Drabas

AI/ML Big Data Cloud Computing Data Engineering PySpark Spark Data Streaming apache-spark data data-engineering

"Learning PySpark" guides you through mastering the integration of Python with Apache Spark to build scalable and efficient data applications. You'll delve into Spark 2.0's architecture, efficiently process data, and explore PySpark's capabilities ranging from machine learning to structured streaming. By the end, you'll be equipped to craft and deploy robust data pipelines and applications. What this Book will help me do Master the Spark 2.0 architecture and its Python integration with PySpark. Leverage PySpark DataFrames and RDDs for effective data manipulation and analysis. Develop scalable machine learning models using PySpark's ML and MLlib libraries. Understand advanced PySpark features such as GraphFrames for graph processing and TensorFrames for deep learning models. Gain expertise in deploying PySpark applications locally and on the cloud for production-ready solutions. Author(s) Authors None Drabas and None Lee bring extensive experience in data engineering and Python programming. They combine a practical, example-driven approach with deep insights into Apache Spark's ecosystem. Their expertise and clarity in writing make this book accessible for individuals aiming to excel in big data technologies with Python. Who is it for? This book is best suited for Python developers who want to integrate Apache Spark 2.0 into their workflow to process large-scale data. Ideal readers will have foundational knowledge of Python and seek to build scalable data-intensive applications using Spark, regardless of prior experience with Spark itself.

Dask with Matthew Rocklin - Episode 2

2017-01-22 · Data Engineering Podcast Listen

podcast_episode

by Matthew Rocklin , Tobias Macey

Airflow Analytics Big Data Data Analytics Data Engineering GitHub Hadoop Kubernetes Luigi NumPy Pandas Spark

Summary

There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.

Interview with Matthew Rocklin

Introduction How did you get involved in the area of data engineering? Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated? There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options? One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long? Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future? Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons? There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill? For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like? How do the task graph capabilities compare to something like Airflow or Luigi? Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution? What are some of the most interesting or unexpected projects that you have seen Dask used for? What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support? What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project? I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?

Keep in touch

@mrocklin on Twitter mrocklin on GitHub

Links

http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss https://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss Continuum Analytics 2sigma X-Array Tornado

Website Podcast Interview

Airflow Luigi Mesos Kubernetes Spark Dryad Yarn Read The Docs XData

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Practical Business Intelligence

2016-12-21 · O'Reilly Business Intelligence Books O'Reilly Amazon

book

by Ahmed Sherif

Analytics BI Data Analytics Data Management JavaScript Qlik SQL Tableau business-intelligence data data-science

Master the art of business intelligence in just a few steps with this hands-on guide. By following the detailed examples and techniques in this book, you'll learn to create effective BI solutions that analyze data for strategic decision-making. You'll explore tools like D3.js, R, Tableau, QlikView, and Python to visualize data and gain actionable insights. What this Book will help me do Develop the ability to create self-service reporting environments for business analytics. Understand and apply SQL techniques to aggregate and manipulate data effectively. Design and implement data models suitable for analytical and reporting purposes. Connect data warehouses with advanced BI tools to streamline reporting processes. Analyze and visualize data using industry-leading tools like D3.js, R, Tableau, and Python. Author(s) Written by seasoned experts in data analytics and business intelligence, the authors bring years of industry experience and practical insights to this well-rounded guide. They specialize in turning complex data into manageable, insightful BI solutions. Their writing style is approachable yet detailed, ensuring you gain both foundational and advanced knowledge in a structured way. Who is it for? This book caters to data enthusiasts and professionals in roles such as data analysis, BI development, or data management. It's perfect for beginners seeking practical BI skills, as well as experienced developers looking to integrate and implement sophisticated BI tools. The focus is on actionable insights, making it ideal for anyone aiming to leverage data for business growth.

Principles of Data Science

2016-12-16 · O'Reilly Data Science Books O'Reilly Amazon

book

by Sinan Ozdemir (LoopGenius)

AI/ML Analytics Data Science data data-science

If you've ever wondered how to bridge the gap between mathematics, programming, and actionable data insights, 'Principles of Data Science' is the guide for you. This book explores the full data science pipeline, providing you with tools and knowledge to transform raw data into impactful decisions. With practical lessons and hands-on tutorials, you'll master the essential skills of a data scientist. What this Book will help me do Understand and apply the five core steps of the data science process. Gain insight into data cleaning, visualization, and effective communication of results. Learn and implement foundational machine learning models using Python or R. Bridge gaps between mathematics, statistics, and programming to solve data-driven problems. Evaluate machine learning models using key metrics for better predictive capabilities. Author(s) The author, a seasoned data scientist with years of professional experience in analytics and software development, brings a rich perspective to the topic. Combining a strong foundation in mathematics with expertise in Python and R, they have worked on diverse real-world data projects. Their teaching philosophy emphasizes clarity and practical application, ensuring you not only gain knowledge but also know how to apply it effectively. Who is it for? This book is intended for individuals with a basic understanding of algebra and some programming experience in Python or R. It is perfect for programmers who wish to dive into the world of data science or for those with math skills looking to apply them practically. If you seek to turn raw data into valuable insights and predictions, this book is tailored for you.

Python Data Science Handbook

2016-11-21 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jake VanderPlas

AI/ML Data Science Matplotlib NumPy Pandas Scikit-learn programming-languages software-development

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you’ll learn how to use: IPython and Jupyter: provide computational environments for data scientists using Python NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python Matplotlib: includes capabilities for a flexible range of data visualizations in Python Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms

Programming Pig, 2nd Edition

2016-11-09 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Alan Gates , Daniel Dai

Data Modelling Hadoop HDFS data data-engineering pig

For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets. Updated with use cases and programming examples, this second edition is the ideal learning tool for new and experienced users alike. You’ll find comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell. When you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig. Delve into Pig’s data model, including scalar and complex data types Write Pig Latin scripts to sort, group, join, project, and filter your data Use Grunt to work with the Hadoop Distributed File System (HDFS) Build complex data processing pipelines with Pig’s macros and modularity features Embed Pig Latin in Python for iterative processing and other advanced tasks Use Pig with Apache Tez to build high-performance batch and interactive data processing applications Create your own load and store functions to handle data formats and storage mechanisms

Spark in Action

2016-11-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Marko Bonaci , Petar Zecevic

AI/ML Analytics API Big Data DevOps Docker Java Scala Spark SQL Data Streaming Virtual Machine +3 more

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0. About the Technology Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades. About the Book Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code. What's Inside Updated for Spark 2.0 Real-life case studies Spark DevOps with Docker Examples in Scala, and online in Java and Python About the Reader Written for experienced programmers with some background in big data or machine learning. About the Authors Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community. Quotes Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide. - Jonathan Sharley, Pandora Media Must-have! Speed up your learning of Spark as a distributed computing framework. - Robert Ormandi, Yahoo! An easy-to-follow, step-by-step guide. - Gaurav Bhardwaj, 3Pillar Global An ambitiously comprehensive overview of Spark and its diverse ecosystem. - Jonathan Miller, Optensity

Introduction to Machine Learning with Python

2016-10-11 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Sarah Guido , Andreas C. Müller

AI/ML Data Science Matplotlib NumPy Scikit-learn ai-ml data machine-learning

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination. Youâ??ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas MÃ¼ller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book. With this book, youâ??ll learn: Fundamental concepts and applications of machine learning Advantages and shortcomings of widely used machine learning algorithms How to represent data processed by machine learning, including which data aspects to focus on Advanced methods for model evaluation and parameter tuning The concept of pipelines for chaining models and encapsulating your workflow Methods for working with text data, including text-specific processing techniques Suggestions for improving your machine learning and data science skills

Mastering QGIS - Second Edition

2016-09-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Luigi Pirelli , Paolo Corti , Richard Smith Jr., GISP , Kurt Menke, GISP , John Van Hoesen, GISP

DataViz GIS data data-engineering geographic-information-system-gis geographic information system (gis) location-data

Dive into advanced GIS techniques with 'Mastering QGIS,' a comprehensive guide that teaches you how to leverage the full capabilities of the open-source GIS software QGIS. Through practical examples, you'll advance your skills from the fundamentals to professional levels by developing plugins, automating workflows, and mastering data visualization. What this Book will help me do Create comprehensive spatial databases to organize and analyze GIS data effectively. Master advanced styling techniques for professional-quality map presentation. Process vector and raster data, including preparing and analyzing data for specific use cases. Integrate Python scripting to automate GIS data workflows and extend QGIS functionality. Develop custom QGIS plugins to tailor the software to your projects and needs. Author(s) Kurt Menke, GISP, along with co-authors recognized as experts in GIS, share their extensive experience with QGIS. They bring a practical approach aimed at GIS professionals seeking deeper software mastery. Who is it for? This book is ideal for GIS professionals, students, and analysts intending to elevate their QGIS competency. Whether you're looking to switch from proprietary GIS tools or enhance your open-source skillset, this resource provides the expertise required to excel in your field.

Big Data Analytics

2016-09-28 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Aravind Nallan , Venkat Ankam

AI/ML Analytics Big Data Data Analytics DataViz Hadoop Apache HBase Kafka Scala Spark SQL Data Streaming +3 more

Dive into the world of big data with "Big Data Analytics: Real Time Analytics Using Apache Spark and Hadoop." This comprehensive guide introduces readers to the fundamentals and practical applications of Apache Spark and Hadoop, covering essential topics like Spark SQL, DataFrames, structured streaming, and more. Learn how to harness the power of real-time analytics and big data tools effectively. What this Book will help me do Master the key components of Apache Spark and Hadoop ecosystems, including Spark SQL and MapReduce. Gain an understanding of DataFrames, DataSets, and structured streaming for seamless data handling. Develop skills in real-time analytics using Spark Streaming and technologies like Kafka and HBase. Learn to implement machine learning models using Spark's MLlib and ML Pipelines. Explore graph analytics with GraphX and leverage data visualization tools like Jupyter and Zeppelin. Author(s) Venkat Ankam, an expert in big data technologies, has years of experience working with Apache Hadoop and Spark. As an educator and technical consultant, Venkat has enabled numerous professionals to gain critical insights into big data ecosystems. With a pragmatic approach, his writings aim to guide readers through complex systems in a structured and easy-to-follow manner. Who is it for? This book is perfect for data analysts, data scientists, software architects, and programmers aiming to expand their knowledge of big data analytics. Readers should ideally have a basic programming background in languages like Python, Scala, R, or SQL. Prior hands-on experience with big data environments is not necessary but is an added advantage. This guide is created to cater to a range of skill levels, from beginners to intermediate learners.

2016 Data Science Salary Survey

2016-09-16 · O'Reilly Data Science Books O'Reilly Amazon

book

by Roger Magoulas , John King

Data Science data data-science data-science-as-a-profession

In this fourth edition of O’Reilly’s Data Science Salary Survey, 983 respondents working across a variety of industries answered questions about the tools they use, the tasks they engage in, and the salaries they make. This year’s survey includes data scientists, engineers, and others in the data space from 45 countries and 45 US states. The 2016 survey included new questions, most notably about specific data-related tasks that may affect salary. Plug in your own data points to the survey model and see how you compare to other data science professionals in your industry. With this report, you’ll learn: Where data scientists make the highest salaries—by country and by US state Tools that respondents most commonly use on the job, and tools that contribute most to salary Two activities that contribute to higher earnings among respondents How gender and bargaining skills affect salaries when all other factors are equal Salary differences between those using open source tools vs those using proprietary tools Salary differences between those who rely on Python vs those who use several tools Participate in the 2017 Survey The survey is now open for the 2017 report. Spend just 5 to 10 minutes and take the anonymous salary survey here: https://www.oreilly.com/ideas/take-the-2017-data-science-salary-survey.

Music21

2016-09-02 · Data Skeptic Listen

podcast_episode

by Kyle Polich , Michael Cuthbert (Netflix)

AI/ML

Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today. Music21 is a python library making analysis of music accessible and fun. It supports integration with popular formats such as MIDI, MusicXML, Lilypond, and others. It's also well integrated with The Elvis Project, enabling users to import large volumes of music for easy analysis. Music21 is a great platform for musicologists and machine learning researchers alike to explore patterns and structure in music.

Sams Teach Yourself Apache Spark™ in 24 Hours

2016-08-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jeffrey Aven

AI/ML API Big Data Cassandra Cloud Computing Data Engineering Kafka NoSQL Scala Spark SQL Data Streaming +3 more

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark’s amazing speed, scalability, simplicity, and versatility. This book’s straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark–now, and for years to come. You’ll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data. Learn how to • Discover what Apache Spark does and how it fits into the Big Data landscape • Deploy and run Spark locally or in the cloud • Interact with Spark from the shell • Make the most of the Spark Cluster Architecture • Develop Spark applications with Scala and functional Python • Program with the Spark API, including transformations and actions • Apply practical data engineering/analysis approaches designed for Spark • Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output • Optimize Spark solution performance • Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra) • Leverage cutting-edge functional programming techniques • Extend Spark with streaming, R, and Sparkling Water • Start building Spark-based machine learning and graph-processing applications • Explore advanced messaging technologies, including Kafka • Preview and prepare for Spark’s next generation of innovations Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

Interactive Spark using PySpark

2016-08-15 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Benjamin Bengfort , Jenny Kim

AI/ML Analytics Bash Big Data Data Analytics DataViz Hadoop Java PySpark Scala Spark apache-spark +2 more

Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark. Why is it important? PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc. What you'll learn—and how you can apply it Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year. This lesson is for you because… You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first Prerequisites Familiarity with writing Python applications Some familiarity with bash command-line operations Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc. Materials or downloads needed in advance Apache Spark This lesson is taken from by Jenny Kim and Benjamin Bengfort. Data Analytics with Hadoop

[MINI] Survival Analysis

2016-07-29 · Data Skeptic Listen

podcast_episode

by Kyle Polich

Survival analysis techniques are useful for studying the longevity of groups of elements or individuals, taking into account time considerations and right censorship. This episode explores how survival analysis can describe marriages, in particular, using the non-parametric Cox proportional hazard model. This episode discusses some good summaries of survey data on marriage and divorce which can be found here. The python lifelines library is a good place to get started for people that want to do some hands on work.

Cassandra: The Definitive Guide, 2nd Edition

2016-07-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eben Hewitt , Jeff Carpenter

Cassandra Cloud Computing Data Modelling Docker ELK Hadoop Java JavaScript Spark data data-engineering nosql-databases

Imagine what you could do if scalability wasn't a problem. With this hands-on guide, you’ll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This expanded second edition—updated for Cassandra 3.0—provides the technical details and practical examples you need to put this database to work in a production environment. Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra’s non-relational design, with special attention to data modeling. If you’re a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra’s speed and flexibility. Understand Cassandra’s distributed and decentralized structure Use the Cassandra Query Language (CQL) and cqlsh—the CQL shell Create a working data model and compare it with an equivalent relational model Develop sample applications using client drivers for languages including Java, Python, and Node.js Explore cluster topology and learn how nodes exchange data Maintain a high level of performance in your cluster Deploy Cassandra on site, in the Cloud, or with Docker Integrate Cassandra with Spark, Hadoop, Elasticsearch, Solr, and Lucene

Data Visualization with Python and JavaScript

2016-07-12 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Kyran Dale

API DataViz JavaScript Matplotlib NumPy Pandas data data-science data-science-tasks data-visualization

Learn how to turn raw data into rich, interactive web visualizations with the powerful combination of Python and JavaScript. With this hands-on guide, author Kyran Dale teaches you how build a basic dataviz toolchain with best-of-breed Python and JavaScript libraries—including Scrapy, Matplotlib, Pandas, Flask, and D3—for crafting engaging, browser-based visualizations. As a working example, throughout the book Dale walks you through transforming Wikipedia’s table-based list of Nobel Prize winners into an interactive visualization. You’ll examine steps along the entire toolchain, from scraping, cleaning, exploring, and delivering data to building the visualization with JavaScript’s D3 library. If you’re ready to create your own web-based data visualizations—and know either Python or JavaScript— this is the book for you. Learn how to manipulate data with Python Understand the commonalities between Python and JavaScript Extract information from websites by using Python’s web-scraping tools, BeautifulSoup and Scrapy Clean and explore data with Python’s Pandas, Matplotlib, and Numpy libraries Serve data and create RESTful web APIs with Python’s Flask framework Create engaging, interactive web visualizations with JavaScript’s D3 library

Mastering Python Data Analysis

2016-06-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Luiz Felipe Martins , Magnus Vilhelm Persson

AI/ML Pandas data data-science

Mastering Python Data Analysis provides a comprehensive roadmap for Python developers to enhance their data analysis skills to tackle real-world problems. This book delves into advanced statistical analysis, covering tools, models, and methods to transform raw data into valuable insights. What this Book will help me do Effectively handle and preprocess data using Python and Pandas. Explore statistical models to identify patterns and gain insights from data. Learn clustering approaches to detect data groupings and predict outcomes. Utilize Bayesian methods for quantifying causal relationships. Generate professional reports and visualizations with Python tools like Jupyter Notebook. Author(s) None Vilhelm Persson is a seasoned software developer and data analyst with expertise in leveraging Python for sophisticated data analysis and machine learning tasks. Drawing from years of experience in the tech industry, None provides practical, real-world insights throughout the book. His approachable writing style ensures technical concepts are conveyed with clarity, making data analysis accessible to developers at varying skill levels. Who is it for? This book is ideal for intermediate Python developers seeking to elevate their data analysis skills. If you are familiar with Python libraries and have an interest in solving complex data problems, this guide will serve as a stepping stone to mastery. Advanced beginners with a curiosity for statistical methods and a desire to learn through practical examples will find this book invaluable. It is also perfect for professionals aiming to integrate Python-based statistical techniques into their workflow.

talk-data.com

Activity Trend

Top Events

Top Speakers

QGIS: Becoming a GIS Power User

The Data Science Handbook

Learning PySpark

Dask with Matthew Rocklin - Episode 2

Practical Business Intelligence

Principles of Data Science

Python Data Science Handbook

Programming Pig, 2nd Edition

Spark in Action

Introduction to Machine Learning with Python

Mastering QGIS - Second Edition

Big Data Analytics

2016 Data Science Salary Survey

Music21

Sams Teach Yourself Apache Spark™ in 24 Hours

Interactive Spark using PySpark

[MINI] Survival Analysis

Cassandra: The Definitive Guide, 2nd Edition

Data Visualization with Python and JavaScript

Mastering Python Data Analysis