Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



29 Commits

Repository files navigation

Sentiment analysis of Medium app reviews from Google Play Store

This project embarks on a journey to analyze sentiment patterns within Medium app reviews from Google Play Store using natural language processing (NLP) techniques and machine learning algorithms.

To access the live version of the app, click here.


The dataset and the complete data dictionary can be found on Kaggle.


*I am specifically using scikit-learn version 1.2.2 in this project due to a bug which is discussed here.


The data is first fetched from Kaggle and then we remove irrelevant columns reviewId, repliedAt, etc. You can follow the Jupyter Notebook for a detailed walkthrough.

Defining the pipeline

Data acquisition -> Preprocessing -> Train-Test split -> Model building -> Hyperparameter tuning the model -> Saving the model -> Predicting new data using the saved model.


After basic exploratory data analysis steps like checking null values and data types of the columns, there is a huge class imbalance in the sentiment column with POSITIVE having 39982, NEGATIVE having 5863 & NEUTRAL having 7198 values.

Why not simply balance the classes using balancing techniques like RandomUnderSampler or RandomOverSampler?

  • Discarding data can lead to information loss & adding random data to training is unethical.
  • The model might not generalize well to unseen data with the original imbalance.

Next, we clean the text before giving it to the model by removing numbers, special characters & stop words.

Now, we make a LabelEncoder from scikit-learn to encode the sentiment column values.

Finally, the last step before train-test split is converting text into numbers using TF-IDF Vectorizer.

Train-Test Split

Did a 70-30 train-test split with 70% in the training set and 30% for testing set using this scikit-learn library

Model building

Initially, I built a basic Random Forest Classifier model and carried out the predictions.

Evaluation metrics used:

I have specifically chosen Macro F1 score for the following reasons:

  • Balances Performance: It considers the performance of the model on both the majority (Positive) and minority classes (Negative and Neutral), providing a more comprehensive picture of its effectiveness.
  • Identifies Overfitting to Majority Class: A high macro F1 score suggests that the model is performing well on all classes, not just the majority class.
  • Focusing on only Accuracy can be misleading as classifier could simply predict "Positive" for every instance and achieve a very high accuracy (around 80%). However, this wouldn't reflect the model's ability to accurately classify the minority classes (Negative and Neutral) that are also crucial.

Hyperparameter Tuning the model

The existing random forest classifier is hyper-parameter tuned using GridSearchCV with the following grid:

    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]


Next, the model was retrained with the best parameters and the new evaluation metrics are as follows:

F1 score = 0.75 
Accuracy = 0.87
Confusion matrix = 
[[ 1092   148   507]
 [  160  1235   795]
 [  127   312 11537]]

Saving the model

Finally, we train the best_rf_model on training data again & save it along with label_encoder, & vectorizer in optimized_model_steps.pkl.gz using pickle and gzip as follows since it is a large file:

data = 
    "best_model": best_rf_model, 
    "le": label_encoder, 
    "vectorizer": vectorizer

with'optimzed_model_steps.pkl.gz', 'wb') as file:
    pickle.dump(data, file)

Run Locally

Clone the project

  git clone

Go to the project directory

  cd medium-reviews-sentiment-analysis

Install dependencies

  pip install -r requirements.txt

Start the server

  streamlit run


Clone the project

  git clone

Go to the project directory

  cd medium-reviews-sentiment-analysis

Install dependencies

  pip install -r requirements.txt

and finally,

  python contains the whole code. The data will be fetched, preprocessed, model will be built (with the best parameters shown above) and the sentiment of the text you enter in the will be predicted.


If you don't want to rebuild the model everytime after each input, follow this process in your command line AFTER building the best model from notebook/sentiment-analysis.ipynb and save it as optimized_model_steps.pkl.gz:

Clone the project

  git clone

Go to the project directory

  cd medium-reviews-sentiment-analysis

Install dependencies

  pip install -r requirements.txt

and finally,

  python src/

This loads the pre-existing best random forest model from notebooks/optimized_model_steps.pkl.gz and carries out the predictions.


This code has been deployed using Streamlit Community Cloud and the file is

To run the project locally, follow these steps as mentioned above.


No releases published


No packages published