HighlightsCountry rules the US iTunes charts yesterday with 38 track tagsDiana Ross celebrates her 75th birthday with over 102M YouTube viewsSouth Asia gets down with the leading Desi playlist, “Desi Hits” at 241K followersMissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Tuesday March 26th 2019.ChartsCountry fans love their downloads, or at least they did yesterday.On Monday for the United States storefront in the iTunes download marketplace, there were 38 tracks in the Top 200 list with the genre tag “country”.These genre tags are non-exclusive, meaning that a track can be tagged with multiple genres by Apple internally, but it makes quite the impression when the genre “pop” came in with 36 and “hip-hop/rap” at 32.The 38 country tracks are well-distributed across the 200, with the romantic ballad “Beautiful Crazy” by Luke Combs at the #13 spot.What’s interesting is despite the genre’s track influence on the iTunes chart, there are no country artists in the top 10 most frequently occuring artists. For example, Motley Crue had the most tracks on yesterday’s iTunes chart with 11, due to their Netflix biopic “The Dirt” releasing on Friday sparking nostalgia for some and curiosity for others.Or Queen continues to benefit from their five-month-old “Bohemian Rhapsody” biopic with 10 tracks still on the iTunes chart.But fear not, the #1 iTunes track yesterday actually was country: “Shallow” by Lady Gaga and Bradley Cooper from yet another music film, A Star Is Born….it just got coded by Apple as “soundtrack”, and is just another quirk in the music data world.Artist Highlight in the NewsHappy birthday to American legend Diana Ross, who turns 75 years young today!Yesterday, Rolling Stone published news of her live concert film “Diana Ross: Her Life, Love and Legacy” releasing for only two days today and Thursday, showcasing her epic 1983 NYC Central Park show.Ross has healthy numbers in the streaming world, with a 68 Spotify Popularity Index rating, 792K followers and a steadily growing monthly listeners count at 4.2M, growing from 2.9M only a year ago.The ex-Supremes member is still rocking her Instagram account at 319K followers, posting lots of throwback footage and glamorous shots of own career, and mostly to fans her junior: over 126K IG followers are 18-24 and 97K are 25-34.Though Ms. Ross only sports 6K subscribers on YouTube, her long career of epic music has earned over 102M YouTube views, which, given her low amount of subscribers, shows how much active search and recommended video algorithms keep her music alive. Happy birthday, Miss Ross.Playlist Round-Up With Spotify’s recent launch in India about one month ago, attention is likely turning to the playlist “Desi Hits”, which is Spotify’s leading South Asian playlist at 241K followers.Growing followers at 6% over the past month, and changing about a quarter of its tracks every month, “Desi Hits” is the closest thing the platform has to the region’s version of “Today’s Top Hits”, which is at 22M followers.While “Desi Hits” obviously has some ground to make up for, keep in mind that India alone is home to over 374M smartphone users, and 1.3B people overall. For comparison, the US market has 251M out of 326M. Or better put, a market penetration rate of only 27% compared to the US’ 77%, showing lots of potential for growth in India.The playlist “Desi Hits” is dominated by the giant music and film company T-Series, with 35 of its tracks in the 86-track list coming underneath its banner. You may have heard of the company through its on-going YouTube subscriber battle with YouTuber PewDiePie, as T-Series is now within only 4K subscribers at 91,317,xxx subs.“Desi Hits” has virtually all of its tracks in the upper half of the Echo Nest Energy chart, promising lots of danceable beats coming from a set of artists who are 52% from India, but 20% from the UK due to the Indian diaspora.Whether looking for bhangra beats or Bollywood mega hits, “Desi Hits” likely has a lot of followers coming its way.OutroThat’s it for your Daily Data Dump for Tuesday March 26th 2019. This is Jason from Chartmetric, a free account is waiting for you at chartmetric.io/signup. That’s chartmetric (no s) dot IO slash signup.Happy Tuesday, see you tomorrow!
talk-data.com
Topic
Data Streaming
739
tagged
Activity Trend
Top Events
Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric
Interview
Introduction
How did you get involved in t
HighlightsGreyson Chance places 10 tracks on the QQ Music Western charts last weekBillie Eilish gets over 800K “pre-adds” on Apple Music, likely with American females under 25 according to InstagramDeezer’s #1 playlist “Les titres du moment” continues to dominate the platform’s playlist ecosystemMissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Monday March 25th 2019.ChartsTexas-born, Oklahoma-bred Greyson Chance takes up 10% of the QQ Music Western charts for the week of March 14th to the 20th.QQ Music, according to TechCrunch, is one of the largest music streaming platforms in China, with a reported 800M+ monthly active user base when grouped under the Tencent Music Entertainment umbrella. The Chinese tech conglomerate leverages QQ with its social network WeChat (which has over 1B users- that’s billion with a “b”) to provide the majority of its revenue in social media advertising, live-streaming virtual gifts and premium memberships along with traditional song sales and subscription-based revenue.QQ updates a 100-song Western music chart every week with their listeners’ most-streamed tunes, and Greyson Chance got 10 tracks off his newly-released 12-track album “portraits” on the list.The AWAL-backed crooner’s top 4 Spotify cities by monthly listeners are all in Asia, specifically Singapore, Quezon City in the Philippines, Jakarta and Kuala Lumpur.Not the only Westerner with an Asia-conscious strategy, the #1 spot on the same week’s QQ chart went to Bulgarian-Russian singer Kristian Kostov, who is currently competing in the Chinese TV singing competition series, 我是歌手, or translated, “I Am A Singer”. Artist Highlight in the NewsMusic Business Worldwide reported on Thursday last week that LA-based Billie Eilish racked up 800K plus album pre-adds on the Apple Music platform, showing a strong engagement with her audience looking forward to the March 29th release date.Looking at the phenom’s Instagram follower demographics as of last month, the majority of those pre-adds are likely are coming from American females under 25.Her IG handle, @wherearetheavocados, has likely racked up over 15M followers by the time you hear this, and has earned an average of 2.4M likes and 33K comments per post.A little over 73% of those followers are female, with 64% of them under the age of 25. 3.1M of the female followers fall into the 13-17 year old age range, while 5.3M of them are in the 18-24 range.Just over a third are based in the US, with New York City and LA being the two most popular, but São Paulo pulls into the #3 spot.Some of her notable IG followers are Kendall Jenner at 107M followers and Katy Perry at 77M, and maybe they’re among Eilish’s Apple album pre-adds as well!Playlist Round-Up “Les titres du moment” is the most followed playlist on Deezer, with 9.8M fans following the platform’s version of Today’s Top Hits on Spotify or Today’s Hits on Apple Music.The #2 Deezer playlist is “Selección Editorial” focusing on Latin content, but it’s far behind at 6.6M fans, leaving the top playlist lots of room to enjoy its leading status.“Les titres du moment”, or literally “Titles of the Moment”, currently sports 70 tracks that are mostly frontline-focused and featuring a 37% track add-ratio in the past month.Unlike other top playlists on other platforms, this playlist regularly adds and removes tracks throughout the week, mostly from Tuesday to Saturday.Almost ¾ of its historical tracks have remained on the list for one to six months, showing that it’s willing to host its more successful records for longer periods.Currently, the #1 spot goes to electronic-tinged Belgian singer-songwriter Angéle, the #2 spot to the UK’s Sam Smith & American Normani, and the #13 spot goes to Puerto Rico’s Daddy Yankee...so if you’re looking for an internationally-focused, multi-lingual top hits playlist to mix things up, look no further.OutroThat’s it for your Daily Data Dump for Monday March 25th 2019. This is Jason from Chartmetric, a free account is waiting for you at chartmetric.io/signup. That’s chartmetric (no s) dot IO slash signup.Happy Monday, see you tomorrow!
HighlightsMaluma and Metallica lead in Shazam chart presence in Mexico City yesterdayJay-Z’s album “The Blueprint” is missing from Spotify and Apple Music, but gets added to the US National Recording RegistryApple Music’s “This Week on the Voice” continues to refresh its catalog-focused playlistMissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Friday March 22nd 2019.ChartsColombian reggaeton artist Maluma and iconic American metal band Metallica find themselves in the same sentence for probably the first time in history, as both had two tracks in the Shazam Top Tracks chart for Mexico City yesterday.Shazam updates a 50-track chart daily for each city the music recognition app operates in, and the two artists led in having the most tracks in the short list.Maluma’s record “HP” which released last month on Feb 28th is accompanied by his collaboration with fellow Colombian Karol G on the romantic song “Creéme”, which released back in Nov 2018.However, that’s not quite as throwback as Metallica’s two tracks “For Whom the Bell Tolls” and “Orion”, which respectively released in 1984 and 1986.Seeing such different tracks being Shazamed so much in various bars and clubs across Mexico City may come as a surprise, but then again, maybe to be expected from a cosmopolitan city that Spotify called last November the “World’s Music Streaming Mecca”.Artist Highlight in the NewsOn Wednesday, Brooklyn’s legendary rapper Jay-Z put another notch in his Yankee cap when his 2001 album “The Blueprint” got added to the US Library of Congress National Recording Registry, which designates "culturally, historically, or aesthetically important" recordings in America.Despite the accolade, the classic album is missing from two of America’s biggest streaming platforms, Spotify and Apple Music, presumably due to Jay-Z’s involvement with the Tidal streaming service and its focus on exclusivity.So how does one track the Hov in the music data world? YouTube and terrestrial radio might work: Mr. Carter has 1.9M YouTube subscribers and over 678M total views on the platform, which for the past two years has shown a very consistent “weekend bump” pattern, and revealing one way in which his fans still consume his material. In the past month, his work week views would bottom out to about 1M daily views, while his weekend views would spring up to 1.3M. Terrestrial radio is another place, though Jiggaman might be surprised that his beloved New York City doesn’t play him the most in all of America. In the past six months, Detroit and Milwaukee have given Hov an average of 1435 radio spins, while NYC comes out to 1,056.Playlist Round-Up The American singing competition series “The Voice” is currently in its 16th TV season, and you might not know that the producers also maintain an Apple Music curator account aptly titled, “The Voice”.With it, they currently maintain 96 series-related playlists, with “This Week on the Voice” being its most long-running and maintained one.In the US Apple Music storefront, the playlist runs parallel with that week’s music performances. The curators usually add tracks over the weekend on Saturday, while the episodes this season have aired on Monday and Tuesday nights in America.A mostly catalog-focused playlist, it is that rare breed that also features a 100% track add ratio in the past month, due to its unique synchronicity with an ongoing TV series whose songs completely change on a weekly basis.The current track list runs as far back as 1967 with the Bee Gees’ “To Love Somebody”, and as recent as last year with Luke Combs’ “Beautiful Crazy”.A mishmash of karaoke-worthy hits, be sure to add it to your Apple Music app if you want to re-live the episode all the way through to the weekend.OutroThat’s it for your Daily Data Dump for Friday March 22nd 2019. This is Jason from Chartmetric, virtually all this data is available at your fingertips, feel free to sign up for a free account at chartmetric.io/signup. That’s chartmetric (no s) dot IO slash signup.Bye for now, have a great weekend!
HighlightsImagine Dragons reigns supreme on the Amazon Music Top Songs chartPower couple Karol G and Anuel AA make a perfect Instagram pairingDeezer Music continues to make in-roads in Brazil with three playlists rapidly growing their follower countsMissionGood morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Wed March 20th 2019.ChartsAmerican pop rock band Imagine Dragons is currently dominating the Amazon Music Top Songs chart with six tracks in the 100-track chart.According to MIDiA Research’s mid-2018 market report, Amazon Music is the #3 biggest streaming platform in the world by paying subscribers, at 12% market share and 27.9M subscribers.As of yesterday March 19th, six of tracks from the Las Vegas-born band stagger the platform’s most played songs from position 10 down to position 82.The mid-tempo track “Natural” leads the group from Imagine Dragons’ latest album “Origins” in the number 10 spot, while the sample-laden record “Thunder” is next in the number 26 spot, from their 2017 album “Evolve”.In case you were missing their smash hit “Radioactive”, it’s still showing lots of gas left in the tank in the 82nd spot on the Amazon Music chart, while clocking in 849M spins on Spotify.Artist Highlight in the NewsColombian artist Karol G and Puerto Rican artist Anuel AA are hanging strong in the Top 100 Spotify artists by monthly listeners, with Karol G at number 87 with 20.7M and Anuel AA at number 64 with 23.7M.The Latin power couple publicly built a relationship over most of 2017 and 2018 through an enduring series of Instagram posts together, to their fans delight: their combined IG audience adds up to over 28M followers.From a business perspective, their relationship seemingly combines their audiences brilliantly: their top 3 Spotify cities by monthly listeners are virtually identical: Santiago, Mexico City, and Buenos Aires. Their top countries by YouTube views are also the same: Mexico, Colombia and Argentina. But according to their Instagram demographics, their gender split is the mirror opposite: ¾ of Karol G’s followers are female and ¼ male, while Anuel AA’s followers are ¾ male and ¼ female. And all that data adds up to a great way to pack their next stadium.Playlist Round-Up Deezer Music, the #5 most used streaming service worldwide by subscriber count, wedged between Tencent Music at #4 and Google at #6, is getting the Brazilian traction they wanted with several new playlists.“Vem pro Sertanejo” leads the pack with 48K new followers (or 4% growth) over the past month totalling at 1.2M, which highlights a popular domestic genre that loosely parallels the Brazilian version of American country music. “Funkadão”, a high-energy playlist focused filling dance floors, is the 2nd fastest growing playlist at 39K more followers than the last month, totalling at 1M. Serving as evidence that Deezer’s domestic repertoire-first approach is working there, Deezer’s third fastest growing playlist is “Sertanejo Apaixonado”, another sertanejo genre list with 6% follower growth in the past month, totalling 554K followers.According to Music Ally, a whopping ⅔ of the $90M streaming market in Brazil were subscription-based in 2016, so it looks like they’ve been onto something. OutroThat’s it for your Daily Data Dump for Wednesday March 20th 2019. This is Jason from Chartmetric, virtually all this data is available at your fingertips, feel free to sign up for a free account at chartmetric.io/signup. That’s chartmetric (no s) dot IO slash signup.Bye for now!
Summary Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. "There aren’t enough data conferences out there that focus on the community, so that’s why these folks built a better one": Data Council is the premier community powered data platforms & engineering event for software engineers, data engineers, machine learning experts, deep learning researchers & artificial intelligence buffs who want to discover tools & insights to build new products. This year they will host over 50 speakers and 500 attendees (yeah that’s one of the best "Attendee:Speaker" ratios out there) in San Francisco on April 17-18th and are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Bergh about the current state of DataOps and why it’s more than just DevOps for data
Interview
Introduction How did you get involved in the area of data management? We talked last year about what DataOps is, but can you give a quick overview of how the industry has changed or updated the definition since then?
It is easy to draw parallels between DataOps and DevOps, can you provide some clarity as to how they are different?
How has the conversat
Carry out data analysis with PySpark SQL, graphframes, and graph data processing using a problem-solution approach. This book provides solutions to problems related to dataframes, data manipulation summarization, and exploratory analysis. You will improve your skills in graph data analysis using graphframes and see how to optimize your PySpark SQL code. PySpark SQL Recipes starts with recipes on creating dataframes from different types of data source, data aggregation and summarization, and exploratory data analysis using PySpark SQL. You’ll also discover how to solve problems in graph analysis using graphframes. On completing this book, you’ll have ready-made code for all your PySpark SQL tasks, including creating dataframes using data from different file formats as well as from SQL or NoSQL databases. What You Will Learn Understand PySpark SQL and its advanced features Use SQL and HiveQL with PySpark SQL Work with structured streaming Optimize PySpark SQL Master graphframes and graph processing Who This Book Is For Data scientists, Python programmers, and SQL programmers.
Summary Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Segment is and how the business got started?
What are some of the primary ways that your customers are using the Segment platform? How have the capabilities and use cases of the Segment platform changed since it was first launched?
Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the over
Summary Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects
Interview
Introduction How did you get involved in the area of data management? Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it? What has been your personal experience with deep learning and what set you down that path? What is involved in building a data pipeline and production infrastructure for a deep learning product?
How does that differ from other types of analytics projects such as data warehousing or traditional ML?
For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of? What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate? What are some ways that we can use deep learning as part of the data management process?
How does that shift the infrastructure requirements for our platforms?
Cloud providers have b
Dive into the world of scalable data processing with the "Apache Spark Quick Start Guide." This book offers a foundational introduction to Spark, empowering readers to harness its capabilities for big data processing. With clear explanations and hands-on examples, you'll learn to implement Spark applications that handle complex data tasks efficiently. What this Book will help me do Understand and implement Spark's RDDs and DataFrame APIs to process large datasets effectively. Set up a local development environment for Spark-based projects. Develop skills to debug and optimize slow-performing Spark applications. Harness built-in modules of Spark for SQL, streaming, and machine learning applications. Adopt best practices and optimization techniques for high-performance Spark applications. Author(s) Shrey Mehrotra is a seasoned software developer with expertise in big data technologies, particularly Apache Spark. With years of hands-on industry experience, Shrey focuses on making complex technical concepts accessible to all. Through his writing, he aims to share clear, practical guidance for developers of all levels. Who is it for? This guide is perfect for big data enthusiasts and professionals looking to learn Apache Spark's capabilities from scratch. It's aimed at data engineers interested in optimizing application performance and data scientists wanting to integrate machine learning with Spark. A basic familiarity with either Scala, Python, or Java is recommended.
Summary
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.
Introduction
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year
Interview
Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
What were the most challenging aspects of reaching that goal?
In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
What is in store for the future of the Timescale product and organization?
Contact Info
Ajay
@acoustik on Twitter LinkedIn
Mike
LinkedIn Website @michaelfreedman on Twitter
Timescale
Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL
Podcast Interview
RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar
Podcast Episode
Spark
Podcast Episode
Flink
Podcast Episode
Hadoop DevOps PipelineDB
Podcast Interview
Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Pravega is and the story behind it? What are the use cases for Pravega and how does it fit into the data ecosystem?
How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?
How do you represent a stream on-disk?
What are the benefits of using this format for persisted streams?
One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides? I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster? For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward? What are some of the unique system design patterns that are made possible by Pravega? How is Pravega architected internally? What is involved in integrating engines such as Spark, Flink, or Storm with Pravega? A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?
Does it have any special capabilities for simplifying processing of out-of-order events?
For someone planning a deployment of Pravega, what is involved in building and scaling a cluster?
What are some of the operational edge cases that users should be aware of?
What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega? What are some cases where you would recommend against using Pravega? What is in store for the future of Pravega?
Contact Info
tkaitchuk on GitHub LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooli
Dive into the world of Apache Kafka with this concise guide that focuses on its practical use for real-time data processing in distributed systems. You'll explore Kafka's capabilities, covering essentials like configuration, messaging, serialization, and handling complex data streams using Kafka Streams and KSQL. By the end, you'll be equipped to tackle real-world streaming challenges confidently. What this Book will help me do Understand how to set up and configure Apache Kafka for real-time processing environments. Master key concepts like message validation, enrichment, and serialization. Learn to use the Schema Registry for data validation and versioning. Gain hands-on experience with data streaming and aggregation using Kafka Streams. Develop skills in using KSQL for data manipulation and stream querying. Author(s) None Estrada is an experienced software engineer with a deep understanding of distributed systems and real-time data processing. With expertise in Apache Kafka and other event-streaming platforms, Estrada approaches technical writing with an emphasis on clarity and practical application. Their passion for helping developers achieve success is reflected in their authoritative yet approachable style. Who is it for? This book is perfect for software engineers and backend developers interested in mastering real-time data processing using Apache Kafka. It is designed for readers who are eager to solve practical problems in distributed systems, irrespective of whether they have prior Kafka experience. Some familiarity with Java or other JVM languages will be helpful, although not strictly necessary. This is an ideal resource for learners seeking a hands-on, practical approach to Apache Kafka.
Why have stream-oriented data systems become so popular, when batch-oriented systems have served big data needs for many years? In the updated edition of this report, Dean Wampler examines the rise of streaming systems for handling time-sensitive problems—such as detecting fraudulent financial activity as it happens. You’ll explore the characteristics of fast data architectures, along with several open source tools for implementing them. Batch processing isn’t going away, but exclusive use of these systems is now a competitive disadvantage. You’ll learn that, while fast data architectures using tools such as Kafka, Akka, Spark, and Flink are much harder to build, they represent the state of the art for dealing with mountains of data that require immediate attention. Learn how a basic fast data architecture works, step-by-step Examine how Kafka’s data backplane combines the best abstractions of log-oriented and message queue systems for integrating components Evaluate four streaming engines, including Kafka Streams, Akka Streams, Spark, and Flink Learn which streaming engines work best for different use cases Get recommendations for making real-world streaming systems responsive, resilient, elastic, and message driven Explore an example IoT streaming application that includes telemetry ingestion and anomaly detection
In this episode, Daniel Graham dissects the capabilities of data lakes and compares it to data warehouses. He talks about the primary use cases of data lakes and how they are vital for big data ecosystems. He then goes on to explain the role of data warehouses which are still responsible for timely and accurate data but don't have a central role anymore. In the end, both Wayne Eckerson and Dan Graham settle on a common definition for modern data architectures.
Daniel Graham has more than 30 years in IT, consulting, research, and product marketing, with almost 30 years at leading database management companies. Dan was a Strategy Director in IBM’s Global BI Solutions division and General Manager of Teradata’s high-end server divisions. During his tenure as a product marketer, Dan has been responsible for MPP data management systems, data warehouses, and data lakes, and most recently, the Internet of Things and streaming systems.
Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework Key Features Master the art of real-time big data processing and machine learning Explore a wide range of use-cases to analyze large data Discover ways to optimize your work by using many features of Spark 2.x and Scala Book Description Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: Mastering Apache Spark 2.x by Romeo Kienzler Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook What you will learn Get to grips with all the features of Apache Spark 2.x Perform highly optimized real-time big data processing Use ML and DL techniques with Spark MLlib and third-party tools Analyze structured and unstructured data using SparkSQL and GraphX Understand tuning, debugging, and monitoring of big data applications Build scalable and fault-tolerant streaming applications Develop scalable recommendation engines Who this book is for If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.
Work with Apache Spark using Scala to deploy and set up single-node, multi-node, and high-availability clusters. This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic. Practical Apache Spark also covers the integration of Apache Spark with Kafka with examples. You’ll follow a learn-to-do-by-yourself approach to learning – learn the concepts, practice the code snippets in Scala, and complete the assignments given to get an overall exposure. On completion, you’ll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You’ll also become familiar with machine learning algorithms with real-time usage. What You Will Learn Discover the functional programming features of Scala Understand the completearchitecture of Spark and its components Integrate Apache Spark with Hive and Kafka Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queries Work with different machine learning concepts and libraries using Spark's MLlib packages Who This Book Is For Developers and professionals who deal with batch and stream data processing.
Summary
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Spark is?
What are some of the main use cases for Spark? What are some of the problems that Spark is uniquely suited to address? Who uses Spark?
What are the tools offered to Spark users? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? For someone building on top of Spark what are the main software design paradigms?
How does the design of an application change as you go from a local development environment to a production cluster?
Once your application is written, what is involved in deploying it to a production environment? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline? What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments? What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies? What are the limitations of the Spark programming model?
What are the cases where Spark is the wrong choice?
What was your motivation for writing a book about Spark?
Who is the target audience?
What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark? What advice do you have for anyone who is considering or currently using Spark?
Contact Info
@jgperrin on Twitter Blog
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Book Discount
Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com
Links
Apache Spark Spark In Action Book code examples in GitHub Informix International Informix Users Group MySQL Microsoft SQL Server ETL (Extract, Transform, Load) Spark SQL and Spark In Action‘s chapter 11 Spark ML and Spark In Action‘s chapter 18 Spark Streaming (structured) and Spark In Action‘s chapter 10 Spark GraphX Hadoop Jupyter
Podcast Interview
Zeppelin Databricks IBM Watson Studio Kafka Flink
P
Summary
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine
Interview
Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?
What are some use cases that Flink is uniquely qualified to handle?
Where does Flink fit into the current data landscape? How is Flink architected?
How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today?
How does scaling work in a Flink deployment?
What are the scaling limits? What are some of the failure modes that users should be aware of?
How is the statefulness of a cluster managed?
What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster?
What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?
What do they find most interesting or exciting?
Contact Info
LinkedIn @fhueske on Twitter fhueske on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure
Summary
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease
Interview
Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?
What are your goals for the platform?
There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?
What are the shortcomings of a data lake architecture?
How is Upsolver architected?
How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?
What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?
Contact Info
Yoni
yoniiny on GitHub LinkedIn
Upsolver
Website @upsolver on Twitter LinkedIn Facebook
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast