API

Behind The Scenes Of The Linode Object Storage Service

2020-03-23 · Data Engineering Podcast Listen

podcast_episode

by Will Smith (Linode) , Tobias Macey

AI/ML Big Data Cloud Computing Data Engineering Data Management GitHub Kubernetes Linux S3 Data Streaming

Summary There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of your object storage product?

What was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze?

What is the scale and scope of usage that you had to design for? Can you describe how your platform is implemented?

What was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch? How have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public?

What have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode? What are the most important capabilities for the underlying hardware that you are running on? What supporting systems and tools are you using to manage the availability and durability of your object storage? How did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage? What are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams? What are your thoughts on the state of the S3 API as a de facto standard for object storage? What is your main focus now that object storage is being rolled out to more data centers?

Contact Info

Dorthu on GitHub dorthu22 on Twitter LinkedIn Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Linode Object Storage Xen Hypervisor KVM (Linux K

SQL Server 2019 Administration Inside Out

2020-03-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by William Assaf , Sven Aelterman , Melody Zacharias , Randolph West , Joseph D'Antoni , Louis Davidson

Azure Big Data Cloud Computing Linux PowerShell SQL data data-engineering microsoft-sql-server relational-databases

Conquer SQL Server 2019 administration–from the inside out Dive into SQL Server 2019 administration–and really put your SQL Server DBA expertise to work. This supremely organized reference packs hundreds of timesaving solutions, tips, and workarounds–all you need to plan, implement, manage, and secure SQL Server 2019 in any production environment: on-premises, cloud, or hybrid. Six experts thoroughly tour DBA capabilities available in SQL Server 2019 Database Engine, SQL Server Data Tools, SQL Server Management Studio, PowerShell, and Azure Portal. You’ll find extensive new coverage of Azure SQL, big data clusters, PolyBase, data protection, automation, and more. Discover how experts tackle today’s essential tasks–and challenge yourself to new levels of mastery. Explore SQL Server 2019’s toolset, including the improved SQL Server Management Studio, Azure Data Studio, and Configuration Manager Design, implement, manage, and govern on-premises, hybrid, or Azure database infrastructures Install and configure SQL Server on Windows and Linux Master modern maintenance and monitoring with extended events, Resource Governor, and the SQL Assessment API Automate tasks with maintenance plans, PowerShell, Policy-Based Management, and more Plan and manage data recovery, including hybrid backup/restore, Azure SQL Database recovery, and geo-replication Use availability groups for high availability and disaster recovery Protect data with Transparent Data Encryption, Always Encrypted, new Certificate Management capabilities, and other advances Optimize databases with SQL Server 2019’s advanced performance and indexing features Provision and operate Azure SQL Database and its managed instances Move SQL Server workloads to Azure: planning, testing, migration, and post-migration

Hands On With Google Data Studio

2020-02-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lee Hurst

DataViz Looker Studio data data-science data-science-tasks data-visualization

Learn how to easily transform your data into engaging, interactive visual reports! Data is no longer the sole domain of tech professionals and scientists. Whether in our personal, business, or community lives, data is rapidly increasing in both importance and sheer volume. The ability to visualize all kinds of data is now within reach for anyone with a computer and an internet connection. Google Data Studio, quickly becoming the most popular free tool in data visualization, offers users a flexible, powerful way to transform private and public data into interactive knowledge that can be easily shared and understood. Hands On With Google Data Studio teaches you how to visualize your data today and produce professional quality results quickly and easily. No previous experience is required to get started right away—all you need is this guide, a Gmail account, and a little curiosity to access and visualize data just like large businesses and organizations. Clear, step-by-step instructions help you identify business trends, turn budget data into a report, assess how your websites or business listings are performing, analyze public data, and much more. Practical examples and expert tips are found throughout the text to help you fully understand and apply your new knowledge to a wide array of real-world scenarios. This engaging, reader-friendly guide will enable you to: Use Google Data Studio to access various types of data, from your own personal data to public sources Build your first data set, navigate the Data Studio interface, customize reports, and share your work Learn the fundamentals of data visualization, personal data accessibility, and open data API's Harness the power of publicly accessible data services including Google’s recently released Data Set Search Add banners, logos, custom graphics, and color palettes Hands On With Google Data Studio: A Data Citizens Survival Guide is a must-have resource for anyone starting their data visualization journey, from individuals, consultants, and small business owners to large business and organization managers and leaders.

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

2020-01-13 · Data Engineering Podcast Listen

podcast_episode

by Karthik Ranganathan (Yugabyte) , Tobias Macey

AI/ML Big Data Cassandra Data Engineering Data Management ELK Python SQL Data Streaming postgresql

Summary The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.Your host is Tobias Macey and today I’m interviewing Karthik Ranganathan about YugabyteDB, the open source, high-performance distributed SQL database for global, internet-scale apps.Interview IntroductionHow did you get involved in the area of data management?Can you start by describing what YugabyteDB is and its origin story?A growing trend in database engines (e.g. FaunaDB, CockroachDB) has been an out of the box focus on global distribution. Why is that important and how does it work in Yugabyte? What are the caveats?What are the most notable features of YugabyteDB that would lead someone to choose it over any of the myriad other options? What are the use cases that it is uniquely suited to?What are some of the systems or architecture patterns that can be replaced with Yugabyte?How does the design of Yugabyte or the different ways it is being used influence the way that users should think about modeling their data?Yugabyte is an impressive piece of engineering. Can you talk through the major design elements and how it is implemented?Easy scaling and failover is a feature that many database engines would like to be able to claim. What are the difficult elements that prevent them from implementing that capability as a standard practice? What do you have to sacrifice in order to support the level of scale and fault tolerance that you provide?Speaking of scaling, there are many ways to define that term, from vertical scaling of storage or compute, to horizontal scaling of compute, to scaling of reads and writes. What are the primary scaling factors that you focus on in Yugabyte?How do you approach testing and validation of the code given the complexity of the system that you are building?In terms of the query API you have support for a Postgres compatible SQL dialect as well as a Cassandra based syntax. What are the benefits of targeting compatibility with those platforms? What are the challenges and benefits of maintaining compatibility with those other platforms?Can you describe how the storage layer is implemented and the division between the different query formats?What are the operational characteristics of YugabyteDB? What are the complexities or edge cases that users should be aware of when planning a deployment?One of the challenges of working with large volumes of data is creating and maintaining backups. How does Yugabyte handle that problem?Most open source infrastructure projects that are backed by a business withhold various "enterprise" features such as backups and change data capture as a means of driving revenue. Can you talk through your motivation for releasing those capabilities as open source?What is the business model that you are using for YugabyteDB and how does it differ from the tribal knowledge of how open source companies generally work?What are some of the most interesting, innovative, or unexpected ways that you have seen yugabyte used?When is Yugabyte the wrong choice?What do you have planned for the future of the technical and business aspects of Yugabyte?Contact Info @karthikr on TwitterLinkedInrkarthik007 on GitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.To help other people find the show please leave a review on iTunes and tell your friends and co-workersJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatLinks YugabyteDBGitHubNutanixFacebook EngineeringApache CassandraApache HBaseDelphiFuanaDBPodcast EpisodeCockroachDBPodcast EpisodeHA == High AvailabilityOracleMicrosoft SQL ServerPostgreSQLPodcast EpisodeMongoDBAmazon AuroraPGCryptoPostGISpl/pgsqlForeign Data WrappersPipelineDBPodcast EpisodeCitusPodcast EpisodeJepsen TestingYugabyte Jepsen Test ResultsOLTP == Online Transaction ProcessingOLAP == Online Analytical ProcessingDocDBGoogle SpannerGoogle BigTableSpot InstancesKubernetesCloudformationTerraformPrometheusDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

2019-12-30 · Data Engineering Podcast Listen

podcast_episode

by Vadim Semenov (DataDog) , Tobias Macey

AI/ML Airflow Big Data Cassandra Chef Cloud Computing Dagster Data Engineering Data Management Databricks Datadog Hadoop +13 more

Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog

Interview

Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?

What are some of the strategies which have proven to be most useful in overcoming those challenges?

Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt?

Contact Info

LinkedIn @databuryat on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark

Podcast Episode

Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow

Podcast.init Episode

Apache NiFi

Podcast Episode

Luigi Dagster

Podcast Episode

Prefect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Mining Social Media

2019-12-10 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lam Thuy Vo

Google Sheets HTML Pandas Python data data-science data-science-tasks web-scraping

Did fake Twitter accounts help sway a presidential election? What can Facebook and Reddit archives tell us about human behavior? In Mining Social Media, senior BuzzFeed reporter Lam Thuy Vo shows you how to use Python and key data analysis tools to find the stories buried in social media. Whether you’re a professional journalist, an academic researcher, or a citizen investigator, you’ll learn how to use technical tools to collect and analyze data from social media sources to build compelling, data-driven stories. Learn how to: •Write Python scripts and use APIs to gather data from the social web •Download data archives and dig through them for insights •Inspect HTML downloaded from websites for useful content •Format, aggregate, sort, and filter your collected data using Google Sheets •Create data visualizations to illustrate your discoveries •Perform advanced data analysis using Python, Jupyter Notebooks, and the pandas library •Apply what you’ve learned to research topics on your own Social media is filled with thousands of hidden stories just waiting to be told. Learn to use the data-sleuthing tools that professionals use to write your own data-driven stories.

Matthew Schwartz: How to Maximize Your Use of a BI Tool in New and Imaginative Ways

2019-11-01 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Matthew Schwartz (IQVIA) , Wayne Eckerson (Eckerson Group)

Analytics BI

In this episode, Wayne Eckerson and Matthew Schwartz discuss non-traditional uses of business intelligence tools. Although BI tools have been around for almost three decades, most companies just scratch the surface of what’s possible to do with those tools. Using web layers and APIs, a company can use their imagination to customize and leverage their exiting BI tool-set to monetize data, integrate tribal knowledge and build industry-specific proprietary products.

Matthew Schwartz is the chief technology officer of Sage Hospitality, one of the world's largest hotel operators. Although Matt is responsible for all aspects of Sage’s IT operations, he has a deep fondness for data and analytics, having served as a BI director for several companies, including PetSmart and Staples. Matt firmly believes in the power of BI tools to transform organizations.

Pro D3.js: Use D3.js to Create Maintainable, Modular, and Testable Charts

2019-10-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Marcos Iglesias

CI/CD DataViz JavaScript React d3 data data-science data-science-tasks data-visualization

Go beyond the basics of D3.js to create maintainable, modular, and testable charts and to package them into a library that can be distributed as open source software or kept for private use. This book will show you how to transform regular D3.js chart code into reusable and extendable modules.You know the basics of working with D3.js, but it's time to become a professional D3.js practitioner. This book is your launching pad to refactoring code, composing complex visualizations from small components, working as a team with other developers, and integrating charts with a Continuous Integration system. You'll begin by creating a production-ready chart using D3.js v5, ES2015, and a test-driven approach and then move on to using and extending Britecharts, the reusable charting library based on Reusable API patterns. Finally, you'll see how to use D3.js along with React to document and build your charts to compose a charting library you can release into the NPM repository. With Pro D3.js, you'll become an accomplished D3.js developer in no time. What You Will Learn Create v5 D3.js charts with ES2016 and unit tests Develop modular, testable and extensible code with the Reusable API pattern Work with and extend Britecharts, a reusable charting library created at Eventbrite Use Webpack and npm to create and publish a charting library from your own chart collections Write reference documentation and build a documentation homepage for your library. Who This Book Is For Data scientists, data visualization engineers, and frontend developers with a fundamental knowledge of D3.js and some experience with JavaScript, as well as data journalists and consultants.

Open Source Object Storage For All Of Your Data

2019-09-23 · Data Engineering Podcast Listen

podcast_episode

by Anand Babu Periasamy (MinIO) , Tobias Macey

AI/ML Analytics Big Data Cloud Computing Data Engineering Data Management HDFS Marketing S3 Data Streaming

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.

Interview

Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?

Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)

What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?

What are the constraints and opportunities that are provided by adhering to that API?

Can you describe how MinIO is implemented and the overall system design?

How has that design evolved since you first began working on it?

What assumptions did you have at the outset and how have they been challenged or updated?

What are the axes for scaling that MinIO provides and how does it handle clustering?

Where does it fall on the axes of availability and consistency in the CAP theorem?

One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintain

R Data Science Quick Reference: A Pocket Guide to APIs, Libraries, and Packages

2019-08-07 · O'Reilly Data Science Books O'Reilly Amazon

book

by Thomas Mailund

Analytics Data Science data data-science data-science-tools r

In this handy, practical book you will cover each concept concisely, with many illustrative examples. You'll be introduced to several R data science packages, with examples of how to use each of them. In this book, you’ll learn about the following APIs and packages that deal specifically with data science applications: readr, dibble, forecasts, lubridate, stringr, tidyr, magnittr, dplyr, purrr, ggplot2, modelr, and more. After using this handy quick reference guide, you'll have the code, APIs, and insights to write data science-based applications in the R programming language. You'll also be able to carry out data analysis. What You Will Learn Import data with readr Work with categories using forcats, time and dates with lubridate, and strings with stringr Format data using tidyr and then transform that data using magrittr and dplyrWrite functions with R for data science, data mining, and analytics-based applications Visualize data with ggplot2 and fit data to models using modelr Who This Book Is For Programmers new to R's data science, data mining, and analytics packages. Some prior coding experience with R in general is recommended.

Hands-On Web Scraping with Python

2019-07-15 · O'Reilly Data Science Books O'Reilly Amazon

book

by Anish Chapagain

Python Cyber Security Selenium data data-science data-science-tasks web-scraping

This book, "Hands-On Web Scraping with Python", is your comprehensive guide to mastering web scraping techniques and tools. Harnessing the power of Python libraries like Scrapy, Beautiful Soup, and Selenium, you'll learn how to extract and analyze data from websites effectively and efficiently. What this Book will help me do Master the foundational concepts of web scraping using Python. Efficiently use libraries such as Scrapy, Beautiful Soup, and Selenium for data extraction. Handle advanced scenarios such as forms, logins, and dynamic content in scraping. Leverage XPath, CSS selectors, and Regex for precise data targeting and processing. Improve scraping reliability and manage challenges like cookies, API use, and web security. Author(s) None Chapagain is an accomplished Python programmer and an expert in web scraping methodologies. With years of experience in applying Python to solve practical data challenges, they bring a clear and insightful approach to teaching these skills. Readers appreciate their practical examples and ready-to-use guidance for real-world applications. Who is it for? This book is designed for Python developers and data enthusiasts eager to master web scraping. Whether you're a beginner looking to dep dive into new techniques or an analyst needing reliable data extraction methods, this book offers clear guidance. A basic understanding of Python is recommended to fully benefit from this text.

Maintaining Your Data Lake At Scale With Spark

2019-06-17 · Data Engineering Podcast Listen

podcast_episode

by Michael Armbrust (Databricks) , Tobias Macey

AI/ML Analytics Big Data Cloud Computing Data Analytics Data Engineering Data Lake Data Management Data Science Databricks Delta Spark +1 more

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Interview

Introduction How did you get involved in the area of data m

Managing The Machine Learning Lifecycle

2019-06-10 · Data Engineering Podcast Listen

podcast_episode

by Stepan Pushkarev (Hydrosphere.io) , Tobias Macey

AI/ML Big Data Data Engineering Data Management Data Science Data Streaming

Summary Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Hydrosphere is and share its origin story? In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?

How does it differ from deployment and maintenance

Stream Processing with Apache Spark

2019-06-05 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Francois Garillot , Gerard Maas

AI/ML Analytics Flink Kafka Spark Data Streaming apache-spark data data-engineering

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs. Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API. Learn fundamental stream processing concepts and examine different streaming architectures Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Evolving An ETL Pipeline For Better Productivity

2019-06-04 · Data Engineering Podcast Listen

podcast_episode

by Raghu Murthy (DataCoral) , Aaron Gibralter (Greenhouse) , Tobias Macey

AI/ML Big Data Data Engineering Data Management Data Science Datacoral ETL/ELT Python Data Streaming

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other

R Quick Syntax Reference: A Pocket Guide to the Language, APIs and Library

2019-04-24 · O'Reilly Data Science Books O'Reilly Amazon

book

by Margot Tollefson

data data-science data-science-tools r

This handy reference book detailing the intricacies of R updates the popular first edition by adding R version 3.4 and 3.5 features. Starting with the basic structure of R, the book takes you on a journey through the terminology used in R and the syntax required to make R work. You will find looking up the correct form for an expression quick and easy. Some of the new material includes information on RStudio, S4 syntax, working with character strings, and an example using the Twitter API. With a copy of the R Quick Syntax Reference in hand, you will find that you are able to use the multitude of functions available in R and are even able to write your own functions to explore and analyze data. What You Will Learn Discover the modes and classes of R objects and how to use them Use both packaged and user-created functions in R Import/export data and create new data objects in R Create descriptive functions and manipulate objects in R Take advantage of flow control and conditional statements Work with packages such as base, stats, and graphics Who This Book Is For Those with programming experience, either new to R, or those with at least some exposure to R but who are new to the latest version.

2019-04-12 // Billie Eilish, Ariana Grande and the Spotify Popularity Index

2019-04-12 · How Music Charts Listen

podcast_episode

by Josh (Data Driven Strength) , Jason Joven (Chartmetric)

HighlightsThere can only be one! Billie Eilish takes the coveted 100 Spotify Popularity Index from Ariana Grande.MissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Friday April 12th 2019.Friday Fun FactWe’re going to try a new segment today called Friday Fun Fact where we do a little deep dive into a particular piece of data.Thanks to Komala, one of our brilliant engineers, we noticed that on Wednesday this week, American artist Billie Eilish has received the honor of #1 most popular artist on Spotify according to not followers, not monthly listeners, but to a number we call the Spotify Popularity Index, or SPI.What in the world is that? You don’t see it as a normal Spotify user in the app, but it sure sounds important, and it must be used for something...right?According to Spotify’s API documentation, the number is “The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.” We’ve also noticed that it’s re-calculated about once a day.Before we go further, let’s take a look at the top 5 according to SPI: #2 and #3 is Ariana Grande and Drake at 99, Khalid at 98 and then Post Malone at 97.Apparently there can only be one artist in the 100 SPI spot, and from Feb 11th to April 9th, the crown was worn by Ariana Grande.Now if the SPI is calculated from the SPI of the individual tracks, this would make sense as Grande released her last full-length album Thank U, Next on that Friday February 8th, which naturally launched her into the #1 popularity spot within a few days.In turn, Billie Eilish’s newer album, WHEN WE FALL ASLEEP, WHERE DO WE GO?, was released on Friday March 29th, and it took her a little under two weeks to take the crown.Why did Eilish take about 12 days and Grande only 3? Likely because of the recent competition. Because before Grande in the #1 spot, it was Toronto’s very own Drake with his massively promoted Scorpion album, who was there for over half a year, from June 2018 until Grande’s February album. While Drake is currently in 882 editorial Spotify playlists, Grande at least had the benefit of time on her side for the Scorpion effect to dull its shine slightly.Eilish’s album however, despite Grande currently being in 828 editorial playlists versus Eilish’s 283, needed a few more days to fight through Grande’s very strong presence on the platform.Remember that it’s the popularity of each track that makes up the artist’s SPI, and so their fans’ connection with their material may also come into play as a factor: according to a Music Business Worldwide article on March 21st, Eilish’s 800K album pre-adds on Apple Music shattered any previous record on the platform. While it’s obviously not Spotify, it does say something about how her fans seem to be dedicated to the album, while Grande’s body of work seems to be driven towards singles.It’s also worth noting that Grande’s least played track of her album, “make up” at 47M spins, beats out more than 60% of Eilish’s album tracks by spins, meaning that it must be more about each track’s virality, for lack of a better term, than aggregate numbers, meaning that time, indeed, must play a major factor. OutroThat’s it for your Daily Data Dump for Friday April 12th 2019. This is Jason from Chartmetric.Look out for our new article on the success of female artists on the charts from our newest data scientist team member Josh, on our Medium blog at blog.chartmetric.io.Feel free to sign up for a free account at chartmetric.io/signupAnd article links and show notes are at: chartmetric.transistor.fm/episodes.Happy Friday, have a lovely weekend.

Stream Processing with Apache Flink

2019-04-11 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vasiliki Kalavri , Fabian Hueske (Data Artisans)

Analytics Flink ETL/ELT IoT Data Streaming data data-engineering streaming-messaging streaming & messaging

Get started with Apache Flink, the open source framework that powers some of the world’s largest stream processing applications. With this practical book, you’ll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing. Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink’s DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them. Learn concepts and challenges of distributed stateful stream processing Explore Flink’s system architecture, including its event-time processing mode and fault-tolerance model Understand the fundamentals and building blocks of the DataStream API, including its time-based and statefuloperators Read data from and write data to external systems with exactly-once consistency Deploy and configure Flink clusters Operate continuously running streaming applications

Integration of IBM Aspera Sync with IBM Spectrum Scale: Protecting and Sharing Files Globally

2019-03-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Benjamin C Forsyth , Jose M Gomez , Nils Haustein

Hadoop HDFS IBM S3 data data-engineering

Economic globalization requires data to be available globally. With most data stored in file systems, solutions to make this data globally available become more important. Files that are in file systems can be protected or shared by replicating these files to another file system that is in a remote location. The remote location might be just around the corner or in a different country. Therefore, the techniques that are used to protect and share files must account for long distances and slow and unreliable wide area network (WAN) connections. IBM® Spectrum Scale is a scalable clustered file system that can be used to store all kinds of unstructured data. It provides open data access by way of Network File System (NFS); Server Message Block (SMB); POSIX Object Storage APIs, such as S3 and OpenStack Swift; and the Hadoop Distributed File System (HDFS) for accessing and sharing data. The IBM Aspera® file transfer solution (IBM Aspera Sync) provides predictable and reliable data transfer across large distance for small and large files. The combination of both can be used for global sharing and protection of data. This IBM Redpaper™ publication describes how IBM Aspera Sync can be used to protect and share data that is stored in IBM Spectrum™ Scale file systems across large distances of several hundred to thousands of miles. We also explain the integration of IBM Aspera Sync with IBM Spectrum Scale™ and differentiate it from solutions that are built into IBM Spectrum Scale for protection and sharing. We also describe different use cases for IBM Aspera Sync with IBM Spectrum Scale.

Building the Data Products that People Need - Making Data Simple [Season 3 - Episode 9]

2019-03-06 · Making Data Simple Listen

podcast_episode

by Al Martin (IBM) , Hemanth Manda (IBM) , Madhu Kochar (IBM Automation, IBM)

AI/ML Cloud Computing IBM

Send us a text This week, host Al Martin goes deep with Madhu Kochar and Hemanth Manda, two leaders of product development from the IBM Data and AI team. They discuss the future foundations of digital business -- in particular, the coming age of multicloud and how organizations will contend with data and workloads on cloud systems that span geographies, vendors, and diverse rules for governance. The conversation turns to the need for a data platform that can foster AI initiatives across these diverse environments.

Shownotes:

00:00 - Check us out on YouTube and SoundCloud. 00:10 - Connect with Producer Steve Moore on LinkedIn & Twitter. 00:15 - Connect with Producer Liam Seston on LinkedIn & Twitter. 00:20 - Connect with Producer Rachit Sharma on LinkedIn. 00:25 - Connect with Host Al Martin on LinkedIn & Twitter. 00:40 - Connect with Hemanth Manda on LinkedIn. 00:45 - Connect with Madhu Kochar on LinkedIn.

05:48 – What is Multicloud? 11:38 – Dig into ICP for Data. 18:30 - Learn more about Open API. 32:55 - Check out Stratechery By Ben Thompson.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

talk-data.com

Activity Trend

Top Events

Top Speakers

Behind The Scenes Of The Linode Object Storage Service

SQL Server 2019 Administration Inside Out

Hands On With Google Data Studio

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Mining Social Media

Matthew Schwartz: How to Maximize Your Use of a BI Tool in New and Imaginative Ways

Pro D3.js: Use D3.js to Create Maintainable, Modular, and Testable Charts

Open Source Object Storage For All Of Your Data

R Data Science Quick Reference: A Pocket Guide to APIs, Libraries, and Packages

Hands-On Web Scraping with Python

Maintaining Your Data Lake At Scale With Spark

Managing The Machine Learning Lifecycle

Stream Processing with Apache Spark

Evolving An ETL Pipeline For Better Productivity

R Quick Syntax Reference: A Pocket Guide to the Language, APIs and Library

2019-04-12 // Billie Eilish, Ariana Grande and the Spotify Popularity Index

Stream Processing with Apache Flink

Integration of IBM Aspera Sync with IBM Spectrum Scale: Protecting and Sharing Files Globally

Building the Data Products that People Need - Making Data Simple [Season 3 - Episode 9]