-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05_002_spotify-raw-data-preparation.Rmd
155 lines (125 loc) · 5.13 KB
/
05_002_spotify-raw-data-preparation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Data Preparation"
description: |
Preparation of the raw data downloaded from Spotify.
author:
- first_name: "Joshua"
last_name: "Cook"
url: https://joshuacook.netlify.app
orcid_id: 0000-0001-9815-6879
date: "December 7, 2020"
output:
distill::distill_article:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, dpi = 500, comment = "#>")
library(mustashe)
library(jhcutils)
library(nakedpipe)
library(kableExtra)
library(tidyverse)
theme_set(theme_minimal())
raw_data_dir <- here::here("_raw-data", "MyData")
data_dir <- here::here("data")
```
Here, I read in and organized the raw data provided by Spotify.
The raw data is provided as JSON, so I converted into tidy data frames.
I have not provided the raw data used here because there is some potentially private/financial information, but the results of the processing below are public.
More information about the data can be found in Spotify's [*Understanding my Data*](https://support.spotify.com/uk/article/understanding-my-data/) page.
## Streaming History
Spotify's description:
> A list of items (e.g. songs, videos, and podcasts) listened to or watched in the past year, including:
>
> 1. Date and time of when the stream ended in UTC format (Coordinated Universal Time zone).
> 2. Name of "creator" for each stream (e.g. the artist name if a music track).
> 3. Name of items listened to or watched (e.g. title of music track or name of video).
> 4. “msPlayed”- Stands for how many mili-seconds the track was listened.
Below is the code to prepare the streaming history data that was contained in two JSON files.
```{r, echo=TRUE}
parse_streaming_history <- function(f) {
rjson::fromJSON(file = f) %.% {
map(as_tibble)
bind_rows()
janitor::clean_names()
mutate(
end_time = lubridate::ymd_hm(end_time),
sec_played = ms_played / 1e3
)
select(-ms_played)
}
}
streaming_history <- map_chr(
seq(0, 1),
~ glue::glue("StreamingHistory{.x}.json")
) %>%
map(~ file.path(raw_data_dir, .x)) %>%
map(parse_streaming_history) %>%
bind_rows()
saveRDS(streaming_history, file.path(data_dir, "streaming_history.rds"))
```
There are four columns in this data frame:
- `end_time`: when the song finished playing
- `artist_name`: the artist
- `track_name`: name of the song
- `sec_played`: duration of the song (in seconds)
```{r}
display_head <- function(df) {
df %>%
head() %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
}
display_head(streaming_history)
```
## Inferences
Spotify's description:
> We draw certain inferences about your interests and preferences based on your usage of the Spotify service and using data obtained from our advertisers and other advertising partners.
> This includes a list of market segments with which you are currently associated.
> Depending on your settings, this data may be used to serve interest-based advertising to you within the Spotify service.
I am most excited to dig into this "Inferences" data set.
From a skim, some interesting notes are:
- `3P_Politics - Any Republican_US`, `3P_Politics - Registered Republican_US`,`3P_Politics - Any Democrat_US`, `3P_Politics - Registered Democrat_US`
- `3P_Custom__CNN_US`, `3P_Custom__Conservative Affinity TV News_US`, `3P_Custom__DailyShow_16Oct2020_US`
- `3p_Anime_Manga_Enthusiast_Es`
- `3P_Women's Apparel_CA`
- `3P_Alcohol Consumers_UK` (and some others)
- `3P__Custom_Cigarette Buyers_12Dec2019_US [Do Not Use in 2021]`
- `3P_Custom_Netflix_US`
- `3P_Custom_Parents of Boys 6-11_US`, `3P_Custom_Parents of Kids 5-13_28Apr2020_US`, `3P_Custom_Parents of Toddlers_30Nov2020_US`, etc.
This JSON file was just a single list, so I turned it into a one-column data frame.
```{r, echo=TRUE}
inferences <- rjson::fromJSON(file = file.path(raw_data_dir, "Inferences.json"))
inferences <- tibble(inference = inferences$inference)
```
```{r}
saveRDS(inferences, file.path(data_dir, "inferences.rds"))
display_head(inferences)
```
## Search Queries
Spotify's description:
> A list of searches made, including:
> 1. The date and time the search was made.
> 2. Type of device/platform used (such as iOS, desktop).
> 3. Search Query shows what the user typed in the search field.
> 4. Search interaction URIs shows the list of Uniform Resource Identifiers (URI) of the search results the user interacted with.
The only quirk of this data was that there could be multiple *search interaction URIs* for a single query.
I decided to keep each query as a single row and have the URIs kept as a list data type in the column.
For later analysis, it might be best to assign a search query index and unnest this column.
```{r, echo=TRUE}
search_queries <- rjson::fromJSON(file = file.path(raw_data_dir, "SearchQueries.json")) %>%
head() %>%
map(function(x) {
df <- as_tibble(x)
df$searchInteractionURIs <- list(df$searchInteractionURIs)
return(df)
}) %>%
bind_rows() %>%
janitor::clean_names() %>%
rename(search_interaction_uris = search_interaction_ur_is) %>%
mutate(search_time = lubridate::ymd_hms(search_time))
```
```{r}
saveRDS(search_queries, file.path(data_dir, "search_queries.rds"))
display_head(search_queries)
```