This research paper is part of a collaboration between researchers from the Faculty of Computer Science & Engineering at the Ss. Cyril and Methodius University in Skopje and Boston University.
Company classification is a fundamental task in natural language processing (NLP) that aims to categorize businesses based on their textual descriptions. The purpose of the research paper is to employ various NLP approaches to tackle this problem, each with its own strengths and limitations.
The code is available as Colab notebooks. The easiest way to use the notebooks is to upload them to and run them in Google Colab. The notebooks contain imports of the necessary Python libraries. There are no dependencies and no additional modules or libraries need to be installed.
We suggest to use GPU runtime when running the notebooks for more efficient compiling.
There are 19 Colab notebooks arranged in 5 folders used for classifying companies based of their descriptions. Each of the notebooks processes the datasets in a specific way.
In the notebooks where OpenAI API is used, the "YOUR-API-KEY-HERE" section should be filled with your API key from OpenAI. Please follow the instructions in the notebooks.
This is the brief content for each of the folders:
-
AutoEncoder Approach
- An Autoencoder was used to reduce the unnecessary dimensions after embedding the sentences.
- An OneVsRest Classifier was then used for classification after reducing the dimensionality.
-
BERT Transformer
- This notebook used 'roberta-base' transformer and 'roberta-base' tokenizer for classification.
-
OneVsRest Classification
- This folder contains models utilizing the OneVsRest Classifier and the Support Vector Classifier as its estimator.
- For embedding the sentences, the 'all-mpnet-base-v2' and 'hkunlp_instructor-large' sentence transformers were used.
- OpenAI API was used for comparing the results of the model against ChatGPT predictions on the same data.
- Various techniques were used for cleaning the data.
-
Unsupervised Approach
- Agglomerative Clustering was used for clustering the 11 sectors in GICS.
- For embedding the sentences, the 'nli-distilroberta-base-v2' sentence transformer was used.
- To ease the visualization process, PCA was used for reducing the dimensionality.
-
Zero-Shot Classification
- This folder contains models utilizing the Zero-Shot classification pipeline using the 'valhalla/distilbart-mnli-12-3' model.
- TF-IDF was used for getting more precise representation of the sector names for enhancing the accuracy of the model.
If you use this project in your research, we would appreciate a citation to the following paper:
@article{rizinski2024comparative,
title={Comparative Analysis of NLP-Based Models for Company Classification},
author={Rizinski, Maryan and Jankov, Andrej and Sankaradas, Vignesh and Pinsky, Eugene and Mishkovski, Igor and Trajanov, Dimitar},
journal={Information},
volume={15},
number={2},
pages={77},
year={2024},
publisher={MDPI}
}