moviesimilarity

1. Introduction

the purpose of this project is to find the degree of similarity between movies based on their Descriptions available on IMDb and Wikipedia.

2. Import data

we import data from csv file into a dataframe

combine wiki and imdb plots

The text in the two columns is similar, but they are written in different tones and linguistic expression
so we wil combine both columns

3. Tokenization and stemming

Tokenization

Tokenization is the process by which we break down articles into individual sentences or words
we will also remove tokens which are numeric values or punctuation

stemming

Stemming is the process by which we bring down a word from its different forms to the root word
For example, the words 'fishing', 'fished', and 'fisher' all get stemmed to the word 'fish'.

4. TfidfVectorizer

Create TfidfVectorizer

TF-IDF recognizes words which are unique and important

TF (Term Frequency): how often a term appears
TDF (Inverse Document Frequency) : reduces the importance of a word if it frequently appears.

Fit transform TfidfVectorizer

Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data

5. Create clusters

To determine how closely one movie is related to the other , we can use clustering techniques.
Clustering is the method of grouping together a number of items that presents similar properties
we will cluster our dataset by the genre of the movies

6. Calculate similarity distance

Similarity distance is =

1 - cosine similarity angle

if the movies' plots are similar, the cosine of their angle would be 1 and then the distance between them would be 1 - 1 = 0.
the more the movies are similar , the more the distance is closer to 0

7. Dendograms

We will plot a dendrogram of the movies whose similarity measure will be given by the similarity distance

8. Conclusion

we have quantified the similarity of movies based on their plot summaries available on IMDb and Wikipedia,
we separated them into clusters.
We created a dendrogram to represent how closely the movies are related to each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

moviesimilarity

Table of contents

1. Introduction

2. Import data

combine wiki and imdb plots

3. Tokenization and stemming

Tokenization

stemming

4. TfidfVectorizer

Create TfidfVectorizer

Fit transform TfidfVectorizer

5. Create clusters

6. Calculate similarity distance

7. Dendograms

8. Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

moviesimilarity

Table of contents

1. Introduction

2. Import data

combine wiki and imdb plots

3. Tokenization and stemming

Tokenization

stemming

4. TfidfVectorizer

Create TfidfVectorizer

Fit transform TfidfVectorizer

5. Create clusters

6. Calculate similarity distance

7. Dendograms

8. Conclusion