This module contains methods to accomplish several tasks related to the field of natural language processing (NLP) such as:
Although there are many high-level API available, performing the text cleaning manually can give some advantages in regards of customization.
- Loading the data and selecting relevant parts
- Removing punctuation
- Removing words shorter than a choosen length
- Replace numbers with their word-based equivalent
- Removing stopwords
- Lemmatization
- Tokenization
- Sentiment Analysis
- Part-of-Speech-Tagging (POS-Tagging)
- Named-Entity-Recognition (NER)
- TF-IDF Scoring
- Cosine Similarity
- MinHashing
- WordEmbedding
- Latent Semantic Analysis (LSA)