Improve the user experience by analysing user listening history data to optimize a recommender system feature for music streaming service provider
Music streaming services allow people to listen to various types of music and millions of tracks with their intelligent devices based on their preferences; these advanced features have made listening to music much more accessible than ever (Adiyansjan, Gunawan, & Suhartono, 2019). With the increased competitors in the music streaming industry, having a good user experience is essential to increase competitive advantages and strengthen customer stickiness. In this project, we derived actionable suggestions by analyzing the user's listening history data and experience with the recommender system for improving a music recommender system by Deezer.
- Background
- Problem
- Solution
- Data Source
- Methods
- Data Analysis
- Key Findings
- Conclusion
- Project Reflection
- Notebook
- Reference
Deezer, iis a French music streaming service provider founded in 2006. It provides 73 million tracks and customized features based on subscription types. In addition, Deezer utilizes non-personalized recommendations based on common interests, which filter users' preferences and listening history. In 2016, Deezer introduced an exclusive feature - Flow - an optimized recommendation system based on the user's mood. According to the company, this new feature recommends new or have-listened tracks to users based on their listening history, context and time. In other words, users can listen to music depending on different moods, contexts or specific events.
Users nowadays are exposed to tons of information and face the paradox of choice, which means, having an abundance of choices could delay users in making decisions and deterring their motivation to stay with the services (Maasø & Hagen, 2020). Hence, for businesses who wish to increase competitive advantages and enhance user stickiness towards digital products, it is essential to develop a recommender system, a mechanism that automatically suggests media meeting user’s expectations (Hansen et al., 2021)
The challenge was given by Kaggle: To predict whether the users of the test dataset listened to the first track Flow proposed them or not. Deezer considers that a track is "listened" if the user has listened to more than 30 seconds of it (is_listened =1). If the user presses the skip button to change the song before 30 seconds, then the track is not considered as being listened (is_listened = 0).
The original goal of this kaggle challenge is to improve the recommender system that can accurately predict and suggest a track the user will listen more than 30 seconds. Nonetheless, having a positive User experience is crucial when measuring product performance, besides predicting whether a user would skip a song or not to create a recommender system, in this project, I conducted a user preference analysis to generate insights from user age, user activities, music preference and listening patterns into the user experience to optimize Deezer's recommendation system.
The data is originated from a Kaggle challenge.
The target variable of this dataset is is_listened. There are 7'558'834 obersvations with 14 preditors.
genre_id
: ID of the genre of the songmedia_id
: ID of the song listened by the useralbum_id
: ID of the album of the songmedia_duration
: duration of the songuser_gender
: gender of the useruser_id
: user IDcontext_type
:type of content where the song was listened: playlist, album ...release_date
: release date of the song with the format YYYYMMDDts_listen
: timestamp of the listening in UNIX timeplatform_name
: type of osplatform_family
: type of deviceuser_age
: age of the userlisten_type
: if the songs was listened in a FLOW or notartist_id
: ID of the artist of the songis_listened
: 1 refers to a track that has been listened to, 0 otherwise
- Data prepreocessing
- Data Exploration
- Feature Engnerring and Data Analysis
After the data exploration, we’ve found three main issues in the train dataset:
- There 17 entries of
released_date
are 30000101, which cannot be recognized with the time format - 29,779 data entries where
ts_listen
is greater thanreleased_date
- There are 2 records where
ts_listen
is earlier than the time when Deezer was founded (in 2006)
To better understand user preferences, behaviors and listening patterns, a series of feature engineering was conducted.
- Time-related features: such as year, month, day, weekday, is_weekend, hour, minutes and seconds were derived from
ts_listen
, which indicates the time a user starts to listen to a track. After that, season and sessions were derived from month and hour, and ladled with four seasons and six different time sessions. - User-related features: user behaviour and listening patterns are created by aggregating
user_id
,ts_listen
,user_age
,media_duration
andmedia_id
. listen_diff
: User listen music durationlisten_percent
: the percentage of a song is listenedtime_gap
: the gap before the next listen sesstionlisten_start
: the time a user start to listen musiclisten_end
: the time a user stop to listen music
More detail can be seen in Deezer data analysis result.
Fistly, we quickly have a look at the FLOW feature, which is the column listen_type
. The listen_type
indicates a user listen music use FLOW(listen_type = 1) or not (listen_type = 0). Attributes user_id
, user_age
, media_id
(songs) were aggregated for calculating average number of songs listened per user and the percentage of songs listened across each user age group.
Table 1 gives information about the avarage lenght of songs people listen and percentage of song listening within and without FLOW function. It clearly shows that, user do not use flow function listened 3 times longer than user in the FLOW. More specficly, users who do not use flow function listened nearly 60% of a song, while, users who use flow function only listened less than 20% of a song which is recommmaded by the system.
Table 1. Average media listening percentage with and wihout FLOW function
User age is added to Table 2 to compare user listening behaviour accros ten age groups
Table 2. Media listening duration
listen_type
is added to Table 3 to compare users listening behaviour accros ten age groups within and without FLOW function
Table 3. Media listening percentage with and wihout FLOW function based on Age group
- Time Time is an essential factor which shifts users perderence from time to time. we divide 24 hours into six sessions including midnight, early morning, morning, afternoon, evening and night. The session graph below shows users started listening to music in the morning, reached the peak in the afternoon, and then dropped in the evening, and the hour graph gives information about user activity within with 24 hours.
- Medium
Features
platform_family
andplatform_name
referes to devices and operating system a user use to access to Deezer app, as the data was encoded with numeric value, we cannot tell what devices or opreation system users use, nonetheless, the platform_family 0 and platform_name 0 are the most prefereable mediums amongest users
When it comes to content analysis, genre is one of the features that can differ from time to time, as well as influenced by the surrounding scenarios of users. We found that there are 6 main genres, genre id 0, 7, 10 ,25, 27 and 14, were very popular among all other attributes, such as hour, session, context, platform, listen type and user_age. In other words, no matter the time, the user age or the context, these 6 genres would be favored by the users. Key findings are listed below and graphical analysis can be seen in deezer_eda_result
-
Feature
Flow
- The number of songs, the length of songs and song listened percentage increased gradually as the age rises.
- Young users are more likely to skip songs than the 30-year-old age group.
- Users with a 30 year-old age are more likely to finish songs recommended by the system.
- Users aged 30 listened nearly two times more songs than users aged above 20.
- Majority of users listening in the flow skipped more songs than users who were not in a flow, except users aged 19 and 30
-
User Behaviour
- Number of active users dramatically increased between 5am to 6am.
- The highest number of listeners showed up between 4 to 6pm, with figures above 500,000.
- The number of users constantly decreased in the evening and dropped to 200,000 at 23 pm.
-
Gerne Preference
- Genre_id 0 was the most popular genre among the top 10 ranking.
- Genre_id 0, 7, 10 ,25, 27, 14, 734, 297, 2744 were the most popular.
- Popular genres are beloved across most sessions. Except that genre_id 2744 was not popular during night and midnight, genre_id 50 was preferable during the Night, and genre_id 3645 in the midnight.
- The Number of users listening to genre 0 was four times more without listening in the flow, whereas, there were more variety genres appearing when 6. users were listening in the flow.
- Genre preference was different between user age groups. Among that, gerne_id 0 domainted genre preference across all user age groups, while user age 19 is the main audience of this genre.
To sum up, we found that time is one of the most critical elements that can affect the user when it comes to listening type of songs. Music preference also changed differently between user age groups, platform, and listen environments. To improve the new feature FLOW and reduce user bouncing rate, a context-based recommendation system is suggested, nevertheless, personalized features need to be considered when building such a model.
It was a great experience to work on a dataset which contains millions of entries. Ideally, it would be good to start data processing in a database due to the simplicity of programming. In addition, performing data queries can help us to have a quick glance of data and have better understanding when performing some statistical calculations.
On the other hand, there are many categorical attributes which are replaced with numeric labels in the given dataset; it would be helpful to have the original labels of each categorical variable, which can help analysts form problem statements or hypotheses as well as provide better interpretation when analyzing data.
- Adiyansjan, Gunawan, A. A., & Suhartono, D. (2019). Music Recommader Systen Based on Genre using COnvolutional Recurrent Neural Networls. Procedia Computer Science 157, 99-109.
- Hansen, C., Mehrotra, R., Hansen, C., Brost, B., Maystre, L., & Lalmas, M. (2021). Shigting Consumption towards Diverse Content on Music Streaming Platforms. Proceedings of the 14th ACM International Conference on Web Search and Data MiningMarch, 238-246.
- Maasø, A., & Hagen, A. N. (2020). Metrics and Decision-Making in music streaming. Popular communication Vol. 18, No. 1, 18-31.