This file consists of a piece of text scrapped from a website and involves basic text data pre-processing techniques such as lemmatization, word and senetnce tokenization , stopword removal, punctuation removal, uper to lower case conversion and digit removal. The pre-processed text data is then used for frequency distribution count of words and then used for text ranking and summarization using TF-IDF and Gensim and the results are compared.
Similar approach as above is performed for this text data but N-Gram is sued for the frequency of words calculation and summarization. Here Unigrams, Bigrams and Trigrams are created first and then used for frequency count.
Created word tokens of the sentence, found frequency for each of the unigrams and relative frequency for bigrams. Performed word prediction using the relative frequency and probability.
Used Spacy library to perform Named Entity Recognition on a webscarpped news article.