moviesimilarity

1. Introduction

the purpose of this project is to find the degree of similarity between movies based on their Descriptions available on IMDb and Wikipedia.

2. Import data

we import data from csv file into a dataframe

combine wiki and imdb plots

The text in the two columns is similar, but they are written in different tones and linguistic expression
so we wil combine both columns

3. Tokenization and stemming

Tokenization

Tokenization is the process by which we break down articles into individual sentences or words
we will also remove tokens which are numeric values or punctuation

stemming

Stemming is the process by which we bring down a word from its different forms to the root word
For example, the words 'fishing', 'fished', and 'fisher' all get stemmed to the word 'fish'.

4. TfidfVectorizer

Create TfidfVectorizer

TF-IDF recognizes words which are unique and important

TF (Term Frequency): how often a term appears
TDF (Inverse Document Frequency) : reduces the importance of a word if it frequently appears.

Fit transform TfidfVectorizer

Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data

5. Create clusters

To determine how closely one movie is related to the other , we can use clustering techniques.
Clustering is the method of grouping together a number of items that presents similar properties
we will cluster our dataset by the genre of the movies

6. Calculate similarity distance

Similarity distance is =

1 - cosine similarity angle

if the movies' plots are similar, the cosine of their angle would be 1 and then the distance between them would be 1 - 1 = 0.
the more the movies are similar , the more the distance is closer to 0

7. Dendograms

We will plot a dendrogram of the movies whose similarity measure will be given by the similarity distance

8. Conclusion

we have quantified the similarity of movies based on their plot summaries available on IMDb and Wikipedia,
we separated them into clusters.
We created a dendrogram to represent how closely the movies are related to each other.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
datasets		datasets
README.md		README.md
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moviesimilarity

Table of contents

1. Introduction

2. Import data

combine wiki and imdb plots

3. Tokenization and stemming

Tokenization

stemming

4. TfidfVectorizer

Create TfidfVectorizer

Fit transform TfidfVectorizer

5. Create clusters

6. Calculate similarity distance

7. Dendograms

8. Conclusion

About

Releases

Packages

Languages

najlahamza/moviesimilarity

Folders and files

Latest commit

History

Repository files navigation

moviesimilarity

Table of contents

1. Introduction

2. Import data

combine wiki and imdb plots

3. Tokenization and stemming

Tokenization

stemming

4. TfidfVectorizer

Create TfidfVectorizer

Fit transform TfidfVectorizer

5. Create clusters

6. Calculate similarity distance

7. Dendograms

8. Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages