1- Introduction
2- Import data
3- tokenization and stemming
4- TfidfVectorizer
5- Create clusters
6- calculate similarity distance
7- dendogramme
8- example
9- conclusion
the purpose of this project is to find the degree of similarity between movies based on their Descriptions available on IMDb and Wikipedia.
we import data from csv file into a dataframe
The text in the two columns is similar, but they are written in different tones and linguistic expression
so we wil combine both columns
Tokenization is the process by which we break down articles into individual sentences or words
we will also remove tokens which are numeric values or punctuation
Stemming is the process by which we bring down a word from its different forms to the root word
For example, the words 'fishing', 'fished', and 'fisher' all get stemmed to the word 'fish'.
TF-IDF recognizes words which are unique and important
TF (Term Frequency): how often a term appears
TDF (Inverse Document Frequency) : reduces the importance of a word if it frequently appears.
Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data
To determine how closely one movie is related to the other , we can use clustering techniques.
Clustering is the method of grouping together a number of items that presents similar properties
we will cluster our dataset by the genre of the movies
1 - cosine similarity angle
if the movies' plots are similar, the cosine of their angle would be 1 and then the distance between them would be 1 - 1 = 0.
the more the movies are similar , the more the distance is closer to 0
we have quantified the similarity of movies based on their plot summaries available on IMDb and Wikipedia,
we separated them into clusters.
We created a dendrogram to represent how closely the movies are related to each other.