Instead of embedding words, we will embed movies. In particular, if we can embed movies, then similar movies will be close to each other and can be recommended. This line of reasoning is analogous to the distributional hypothesis of word meanings. For words, this roughly translates to words that appear in similar sentences should have similar vector representations. For movies, vectors for two movies should be similar if they are watched by similar people.
Let the total number of movies be
-
Compute data
$X_{i,j}$ from the movielens (small) dataset and description. Briefly describe your data prep workflow (you can usepandas
if needed). -
Optimize function
$c(v_1,...,v_M)$ over$v_1,...,v_M$ using gradient descent (usingpytorch
ortensorflow
). Plot the loss as a function of iteration for various choices (learning rates, choice of optimizers etc). -
Recommend top 10 movies (not vectors or indices but movie names) given movies (a) Apollo 13, (b) Toy Story, and (c) Home Alone . Describe your recommendation strategy. Do the recommendations change when you change learning rates or optimizers? Why or why not?