-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathActivity5_Joining.Rmd
240 lines (164 loc) · 6.13 KB
/
Activity5_Joining.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Movies and Ratings"
author: "Nathan Diekema"
date: "9/30/2021"
output:
prettydoc::html_pretty:
theme: leonids
---
## Setup
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warnings=FALSE)
```
```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)
library(reprex)
```
## The Data
**Movie_Ratings.csv**
Download link: https://www.dropbox.com/s/ebr2gzy95pb9lsx/Movie%20Ratings.csv?dl=1
Variables:
* Film - The title of a film.
* Genre - The film's primary genre.
* Rotten Tomatoes % - The average movie rating from critics on Rotten Tomatoes, from 0-100.
* Audience Ratings % - The average movie rating voted by audience members on Rotten Tomatoes, from 0-100.
* Budget (million $) - The operating budget of the movie.
* Year of release - The year the movie hit theaters.
**imdb_1000.csv**
Download link: https://www.dropbox.com/s/ov5cntaof9lj9v6/imdb_1000.csv?dl=1
Variables:
* star_rating - The rating of the movie from user votes on imdb.com. (0-10)
* title - The title of the film
* content_rating - The designation of the movie from the Motion Picture Association.
* genre - The primary genre of the film.
* duration - The length of the film, in minutes
* actors_list - The leading actors credited in the movie.
## The Tasks
#### Cleaning/Plotting
*1. Read in and summarize the data.*
```{r message=FALSE, warning=FALSE}
imdb <- read_csv("data/imdb_top_1000.csv")
imdb <- imdb %>%
separate(Genre,
sep=',',
into=c("Genre1","Genre2","Genre3")) %>%
mutate(
Runtime = parse_number(Runtime)
)
head(imdb)
imdb %>%
summarize(
Num_Movies = n_distinct(Series_Title),
Num_Genres = n_distinct(c(Genre1,Genre2,Genre3)),
Avg_Runtime = mean(Runtime),
Avg_IMDB_Rating = mean(IMDB_Rating),
)
```
*2. What genre had the highest average imdb rating?*
```{r}
imdb_long <-
imdb %>%
pivot_longer(
c(Genre1,Genre2,Genre3),
names_to = "Genre Number",
values_to = "Genre"
) %>%
drop_na(IMDB_Rating)
imdb_ratings <-
imdb_long %>%
group_by(Genre) %>%
summarize(
mean_rating = mean(IMDB_Rating)
)
imdb_ratings %>%
arrange(desc(mean_rating))
```
As you can see from the table above, **Western** movies had the highest average IMDB rating.
*3. Is there a relationship between the content rating of the movie (e.g. "PG-13") and its duration? Make a plot.*
```{r fig.align='center'}
imdb %>%
drop_na(Certificate) %>%
ggplot(aes(x = as.character(Certificate), y = Runtime)) +
geom_bar(stat="summary", color="black", fill="lightsalmon1")
```
**Conclusion:**
From the chart above it can be asserted that there is not a strong relationship between a movies content rating and it's run time. The most notable trend is the notably higher run time of unrated movies.
#### Pivoting
*1. Make a column plot comparing Rotten Tomato critic and audience ratings for all the Romance movies.*
```{r message=FALSE, warning=FALSE, fig.align='center'}
ratings <- read_csv("data/Movie_Ratings.csv")
head(ratings)
ratings_long <-
ratings %>%
drop_na(c("Rotten Tomatoes Ratings %", "Audience Ratings %")) %>%
pivot_longer(
c("Rotten Tomatoes Ratings %", "Audience Ratings %"),
names_to = "rating_type",
values_to = "rating"
)
ratings_long %>%
filter(Genre=="Romance") %>%
ggplot(aes(x=rating_type, y=rating)) +
geom_bar(stat="summary", color="black", fill='aquamarine3')
```
*2. For each year, find the average audience rating difference between Comedy and Drama movies.*
```{r, fig.align='center', fig.width=8}
ratings_comedy_drama <- ratings_long %>%
filter(Genre==c("Comedy", "Drama"))
ratings_comedy_drama %>%
ggplot(aes(x=as.factor(`Year of release`), y=rating, fill=Genre)) +
geom_boxplot() +
xlab("Genre") +
ylab("Rating")
```
The average audience rating when compared between Comedy and Drama movies differed the most in 2007 and the least in 2009.
#### Joining
*1. How many movies appear in both datasets?*
```{r message=FALSE, warning=FALSE}
count(
semi_join(x=ratings, y=imdb, by=c("Film"="Series_Title"))
)
```
The datasets given have **41** movies in common.
*2. How many movies appear in only the imdb dataset?*
```{r message=FALSE, warning=FALSE}
count(
anti_join(x=imdb, y=ratings, by=c("Series_Title"="Film"))
)
```
The IMDB dataset contains **959** unique movies that are not in the Rotten Tomatoes dataset.
*3. How many movies appear in only the Rotten Tomatoes dataset?*
```{r message=FALSE, warning=FALSE}
count(
anti_join(x=ratings, y=imdb, by=c("Film"="Series_Title"))
)
```
The Rotten Tomatoes dataset contains **521** unique movies that are not in the imdb dataset.
## Joining and pivoting
*Make a plot comparing the ratings from Rotten Tomatoes Critics, Rotten Tomatoes Audience, and imdb.*
```{r fig.width=8, fig.align='center', message=FALSE, warning=FALSE}
# Clean data sets
imdb_clean <-
imdb %>%
drop_na(IMDB_Rating) %>%
mutate(
IMDB_Rating = IMDB_Rating * 10
)
tomato_clean <-
ratings %>%
drop_na(c("Rotten Tomatoes Ratings %", "Audience Ratings %"))
all_ratings <-
full_join(x=imdb_clean, y=tomato_clean, by=c("Series_Title"="Film")) %>%
pivot_longer(
c("Rotten Tomatoes Ratings %", "Audience Ratings %", "IMDB_Rating"),
names_to = "rating_type",
values_to = "rating"
)
all_ratings %>%
ggplot(aes(x=as.character(rating_type), y=rating, fill=rating_type)) +
geom_boxplot() +
ylab("Rating") +
xlab("Rating Type") +
labs("Rating Type")
```
The box-plot above provides a comparison between the distribution of ratings from IMDB, and rotten tomatoes audience and critics ratings. I scaled IMDB by 10 so it was consistent with the rotten tomatoes percentage scale out of 100%. IMDB varies the least and and has the highest mean rating. The Rotten tomato ratings are a little less forgiving and vary significantly more. The Audience ratings are generally higher than the critics ratings and vary less.