Anish Sachdeva (DTU/2K16/MC/013)
Natural Language Processing - Dr. Seba Susan
📘 Running Notebook | 📄 Input | 📄 Code to print Output | ✒ Project Report
- Introduction
- Implementation
- Results
- Analytics & Discussions
- Running The Project Locally
- Bibliography
Normally the following steps are very commonly and ubiquitously implemented as part of the data preprocessing pipeline in any NLP project:
- Converting the entire corpus into a common case (either lowercase or uppercase)
- Extracting words/tokens from the corpus
- Removing all Punctuations from the selected tokens and retaining only alphanumeric quantities
- Removing all Stopwords from the extracted tokens.
- Stemming the Token into it's root with any Stemming Algorithm
We have seen Stemming in detail in the Porter Stemmer Assignment & in this notebook we see how to extract tokens, remove punctuation and remove stopwords.
The implementation uses the popular nltk (Natural Language Processing Toolkit) for removing stopwords, tokenization and stemming. The entire process has been sown in detail in this Jupyter Notebook.
The input for this program was my Resume in text format that can be viewed here: resume.txt. The output is a stream of tokens (words) that has been converted to lowercase and also the punctuations and stopwords have been removed. The output has been saved as pickle file and can be loaded into any Python file (or Jupyter Notebook) and can be viewed in the Notebook here or in the Pickle File.
We have seen that after removing the stopwords from the resume our number of words has gone down considerably and also that the words we removed never added a lot of meaning to our text. Large companies who will receive many resumes will want to search them using keywords such as Java, Python, web Development etc. and words such as i, me, mine are superfluous in nature.
So, by removing these stopwords we have actually made our corpus more information dense and any other further task we might perform such as converting these words into embeddings or any other Machine Learning/Deep Learning task will now be done on a smaller Corpus and hence would run faster.
Considering these above advantages removing stopwords is a very beneficial pre-processing step.
To run this project locally and see the output of the resume file once the stop words have been removed, first clone the project and enter the project directory.
git clone https://github.com/anishLearnsToCode/stop-words-removal.git
cd stop-words-removal
Install required packages
pip install nltk
pip install pickle
Run the driver.py
file to see stream of tokens as output and then run the analytis.py
file to see various analytics on the prepared token list.
python driver.py
python analytics.py
Run the output.py
file to see a pretty (well formatted) output of the resume.
python output.py