Stopwords Removal From a Corpus

Anish Sachdeva (DTU/2K16/MC/013)

Natural Language Processing - Dr. Seba Susan

📘 Running Notebook | 📄 Input | 📄 Code to print Output | ✒ Project Report

Overview

Introduction
Implementation
Results
Analytics & Discussions
Running The Project Locally
Bibliography

Introduction

Normally the following steps are very commonly and ubiquitously implemented as part of the data preprocessing pipeline in any NLP project:

Converting the entire corpus into a common case (either lowercase or uppercase)
Extracting words/tokens from the corpus
Removing all Punctuations from the selected tokens and retaining only alphanumeric quantities
Removing all Stopwords from the extracted tokens.
Stemming the Token into it's root with any Stemming Algorithm

We have seen Stemming in detail in the Porter Stemmer Assignment & in this notebook we see how to extract tokens, remove punctuation and remove stopwords.

Implementation

The implementation uses the popular nltk (Natural Language Processing Toolkit) for removing stopwords, tokenization and stemming. The entire process has been sown in detail in this Jupyter Notebook.

Results

The input for this program was my Resume in text format that can be viewed here: resume.txt. The output is a stream of tokens (words) that has been converted to lowercase and also the punctuations and stopwords have been removed. The output has been saved as pickle file and can be loaded into any Python file (or Jupyter Notebook) and can be viewed in the Notebook here or in the Pickle File.

Analytics & Discussion

We have seen that after removing the stopwords from the resume our number of words has gone down considerably and also that the words we removed never added a lot of meaning to our text. Large companies who will receive many resumes will want to search them using keywords such as Java, Python, web Development etc. and words such as i, me, mine are superfluous in nature.

So, by removing these stopwords we have actually made our corpus more information dense and any other further task we might perform such as converting these words into embeddings or any other Machine Learning/Deep Learning task will now be done on a smaller Corpus and hence would run faster.

Considering these above advantages removing stopwords is a very beneficial pre-processing step.

Running Project Locally

To run this project locally and see the output of the resume file once the stop words have been removed, first clone the project and enter the project directory.

git clone https://github.com/anishLearnsToCode/stop-words-removal.git
cd stop-words-removal

Install required packages

pip install nltk
pip install pickle

Run the driver.py file to see stream of tokens as output and then run the analytis.py file to see various analytics on the prepared token list.

python driver.py
python analytics.py

Run the output.py file to see a pretty (well formatted) output of the resume.

python output.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
notebook		notebook
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stopwords Removal From a Corpus

Overview

Introduction

Implementation

Results

Analytics & Discussion

Running Project Locally

Bibliography

About

Languages

License

anishLearnsToCode/stop-words-removal

Folders and files

Latest commit

History

Repository files navigation

Stopwords Removal From a Corpus

Overview

Introduction

Implementation

Results

Analytics & Discussion

Running Project Locally

Bibliography

About

Resources

License

Stars

Watchers

Forks

Languages