Skip to content

Removing Stop words from a given corpus of data and performing analysis on the resulting tokens.

License

Notifications You must be signed in to change notification settings

anishLearnsToCode/stop-words-removal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stopwords Removal From a Corpus

Anish Sachdeva (DTU/2K16/MC/013)

Natural Language Processing - Dr. Seba Susan

📘 Running Notebook | 📄 Input | 📄 Code to print Output | ✒ Project Report

stopwords-english

Overview

  1. Introduction
  2. Implementation
  3. Results
  4. Analytics & Discussions
  5. Running The Project Locally
  6. Bibliography

Introduction

Normally the following steps are very commonly and ubiquitously implemented as part of the data preprocessing pipeline in any NLP project:

  1. Converting the entire corpus into a common case (either lowercase or uppercase)
  2. Extracting words/tokens from the corpus
  3. Removing all Punctuations from the selected tokens and retaining only alphanumeric quantities
  4. Removing all Stopwords from the extracted tokens.
  5. Stemming the Token into it's root with any Stemming Algorithm

We have seen Stemming in detail in the Porter Stemmer Assignment & in this notebook we see how to extract tokens, remove punctuation and remove stopwords.

Implementation

The implementation uses the popular nltk (Natural Language Processing Toolkit) for removing stopwords, tokenization and stemming. The entire process has been sown in detail in this Jupyter Notebook.

Results

The input for this program was my Resume in text format that can be viewed here: resume.txt. The output is a stream of tokens (words) that has been converted to lowercase and also the punctuations and stopwords have been removed. The output has been saved as pickle file and can be loaded into any Python file (or Jupyter Notebook) and can be viewed in the Notebook here or in the Pickle File.

Analytics & Discussion

We have seen that after removing the stopwords from the resume our number of words has gone down considerably and also that the words we removed never added a lot of meaning to our text. Large companies who will receive many resumes will want to search them using keywords such as Java, Python, web Development etc. and words such as i, me, mine are superfluous in nature.

So, by removing these stopwords we have actually made our corpus more information dense and any other further task we might perform such as converting these words into embeddings or any other Machine Learning/Deep Learning task will now be done on a smaller Corpus and hence would run faster.

Considering these above advantages removing stopwords is a very beneficial pre-processing step.

Running Project Locally

To run this project locally and see the output of the resume file once the stop words have been removed, first clone the project and enter the project directory.

git clone https://github.com/anishLearnsToCode/stop-words-removal.git
cd stop-words-removal

Install required packages

pip install nltk
pip install pickle

Run the driver.py file to see stream of tokens as output and then run the analytis.py file to see various analytics on the prepared token list.

python driver.py
python analytics.py

Run the output.py file to see a pretty (well formatted) output of the resume.

python output.py

Bibliography

  1. Speech & Language Processing ~Jurafsky
  2. nltk
  3. pickle
  4. Porter Stemmer Algorithm
  5. Porter Stemmer Implementation ~anishLearnsToCode
  6. English Stopwords ~Wikipedia

About

Removing Stop words from a given corpus of data and performing analysis on the resulting tokens.

Resources

License

Stars

Watchers

Forks