Highlights It’s official: We've launched 6MO, our first-ever Global Music Industry Data Report! We're thrilled to present you with our comprehensive view — from a music data perspective — of the first six months of 2019. Dig in to Part 1 with us here.Mission Good morning, it’s Rutger here at Chartmetric with your 3-minute Data Dump where we upload charts, artists, and playlists into your brain so you can stay up on the latest in the music data world.We’re on the socials at “chartmetric” — that’s Chartmetric, no “S.” Follow us on LinkedIn, Instagram, Twitter, or Facebook, and talk to us! We’d love to hear from you.DateThis is your Data Dump for Wednesday, Oct. 2, 2019.6MO Global Music Industry Data Report, Part 1: Semi-Annual AwardsIf you haven’t heard yet, we officially released our first-ever Global Music Industry Data Report on Tuesday, and the response has us very excited to dive into it with you guys here.Last week, we explained the 30-page structure: Semi-Annual Awards, Platform-Playlist Analysis, and Strategic Business Insights.Today, we’re tackling Part 1, our Chartmetric Semi-Annual Awards, which rank the top performing artists in terms of absolute and percentage-based growth across multiple metrics on June 30, 2019, the last day of the six-month period we tracked.By the way, if you’ve got the report in hand, feel free to scroll or flip along with us.First off, our Cross-Platform Performance Award, as you might imagine, revealed some familiar names in the Top 10 in terms of overall streaming and social popularity — from T. Swift to Shawn Mendes and Rihanna to Justin Bieber and Ariana Grande.However, the interesting stories were J Balvin at No. 2 and Daddy Yankee at No. 7, reflecting Latin’s growth outside of Latin America itself, and the late Avicii at No. 10, likely due to his strong catalog consistently driving 3M+ YouTube views daily, his April release of “SOS” with Aloe Blacc, and the full posthumous album release of Tim on June 6.When it came to YouTube Channel Views gain as of June 25, 2019, six of the Top 10 artists with the highest gains were primarily Spanish-speaking, showcasing the strength of both Latin content and also the popularity of the YouTube platform for Latin audiences.Keep in mind, however, that India-specific music charts didn’t launch until two weeks ago, so that data could very well change up the distribution in a big way.Stay tuned for our July to December report to see if 6MO months prove that to be the case!For Spotify Monthly Listener Gain as of June 30, 2019, collaborations were crucial to Lunay’s 557 percent and Jhay Cortez’s 521 percent lifts — not to mention Billy Ray Cyrus’ 3,032 percent increase as a result of his “Old Town Road” collab with Lil Nas X.On Twitter, Follower Gain was all about diversity, with three Korean groups, three Americans, two Brazilians, one Nigerian, and one Turkish rocker comprising the Top 10 percentage gains.And on our own platform, BTS won out on the Artist Follower front and Spotify curators dominated in terms of Playlist Followers. It would be an understatement to say that this is just the tip of the iceberg for Part 1, so please, keep digging into it, and let us know what else you find!Next up, we’re taking on Part 2, our Platform-Playlist Analysis, where we break down artist country market share and artist genre market share on Amazon, Apple, Deezer, and Spotify’s top 30 playlists.So, stay tuned for that!Outro That’s it for your Daily Data Dump for Wednesday, Oct. 2, 2019. This is Rutger from Chartmetric.Free accounts are available at chartmetric.com And article links and show notes are at: podcast.chartmetric.comBy the way, if you haven’t downloaded our report yet, you can find it all across our socials and in our show notes!Happy Wednesday, and we’ll see you on Friday for Part 2!
talk-data.com
Topic
Iceberg
Apache Iceberg
206
tagged
Activity Trend
Top Events
Summary
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it?
How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format?
What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?
What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?
A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu?
If you were to start the project over today what would you do differently?
What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu?
Contact Info
Brock
LinkedIn @brocknoland on Twitter
Jordan
LinkedIn @jordanbirdsell jbirdsell on GitHub
PhData
Website phdata on GitHub @phdatainc on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet
Podcast Episode
ORC HBase Spark
Podcast Episode
Summary
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads
Interview
Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it?
Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it?
What is involved in deploying it to a user’s environment?
For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
Is there a migration path for pre-existing tables into the Iceberg format?
How is schema evolution managed at the file level?
How do you handle files on disk that don’t contain all of the fields specified in a table definition?
One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake?
What are the benefits that outweigh the difficulties?
What have been some of the most challenging or contentious details of the specification to define?
What are some things that you have explicitly left out of the specification?
What are your long-term goals for the Iceberg specification?
Do you anticipate the reference implementation continuing to be used and maintained?
Contact Info
rdblue on GitHub LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
Summary
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data
Interview
Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt?
How does that change as you go from a single user to a team of data engineers and data scientists?
Can you describe the elements of what a data package consists of?
What was your criteria for the file formats that you chose?
How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented?
What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository?
What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt?
Contact Info
Email LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World
Podcast Episode with CTO Bryon Jacob
Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB
Podcast Episode
Netflix Iceberg Kubernetes Helm
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
In a few short years, e-business has gone from a simple concept to an undeniable reality, and for good reason. It works for everyone: Consumers, businesses, and governments. The primary values of e-business, such as cost savings, revenue growth, and customer satisfaction, are proving to be only the tip of the iceberg. Having realized the benefit of Web-enabling individual business processes, many companies now seek further Return On Investment (ROI) by integrating new and existing e-business applications and technologies. The key to their success is to find a way to give customers what they want without the expense of traditional business operations. This IBM Redbook explains the IBM approach to creating e-business solutions. This publication targets IT specialists and architects who want to learn about proven technologies, products, and solutions to build advanced e-business applications. This publication is also written for the technical professional who is planning to take IBM Certification Test 815, IBM e-business Solution Design. This is a revision of Test 811, Designing IBM e-business Solutions. This publication, written by the same people who created Test 815, IBM e-business Solution Design, is a guide to the style and thinking that went into each and every test question. The information in this book is designed to help you prepare for IBM Test 815 and includes helpful tips for taking the test and sample questions.
With Eon on Azure, backups don’t just sit idle—they become a first-class data source. Eon transforms cloud backups into Iceberg tables in Blob Storage, instantly queryable through Microsoft Fabric and OneLake. Learn how backup data flows into Fabric engines like SQL, Spark, and KQL, and how it fuels AI innovation with Azure OpenAI. See how organizations can collaborate more effectively by unifying protection, analytics, and AI on Eon’s data lake.