Skip to content

najlahamza/moviesimilarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

moviesimilarity

Table of contents

1- Introduction
2- Import data
3- tokenization and stemming
4- TfidfVectorizer
5- Create clusters
6- calculate similarity distance
7- dendogramme
8- example
9- conclusion

1. Introduction


the purpose of this project is to find the degree of similarity between movies based on their Descriptions available on IMDb and Wikipedia.

2. Import data


we import data from csv file into a dataframe

combine wiki and imdb plots


The text in the two columns is similar, but they are written in different tones and linguistic expression
so we wil combine both columns

3. Tokenization and stemming

Tokenization


Tokenization is the process by which we break down articles into individual sentences or words
we will also remove tokens which are numeric values or punctuation

stemming


Stemming is the process by which we bring down a word from its different forms to the root word
For example, the words 'fishing', 'fished', and 'fisher' all get stemmed to the word 'fish'.

4. TfidfVectorizer

Create TfidfVectorizer


TF-IDF recognizes words which are unique and important

TF (Term Frequency): how often a term appears
TDF (Inverse Document Frequency) : reduces the importance of a word if it frequently appears.

Fit transform TfidfVectorizer


Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data

5. Create clusters

To determine how closely one movie is related to the other , we can use clustering techniques.
Clustering is the method of grouping together a number of items that presents similar properties
we will cluster our dataset by the genre of the movies

6. Calculate similarity distance

Similarity distance is =

1 - cosine similarity angle


if the movies' plots are similar, the cosine of their angle would be 1 and then the distance between them would be 1 - 1 = 0.
the more the movies are similar , the more the distance is closer to 0

7. Dendograms


We will plot a dendrogram of the movies whose similarity measure will be given by the similarity distance

8. Conclusion


we have quantified the similarity of movies based on their plot summaries available on IMDb and Wikipedia,
we separated them into clusters.
We created a dendrogram to represent how closely the movies are related to each other.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published