Company Classification using NLP Approaches

Overview

This research paper is part of a collaboration between researchers from the Faculty of Computer Science & Engineering at the Ss. Cyril and Methodius University in Skopje and Boston University.

Features

Company classification is a fundamental task in natural language processing (NLP) that aims to categorize businesses based on their textual descriptions. The purpose of the research paper is to employ various NLP approaches to tackle this problem, each with its own strengths and limitations.

Installation

The code is available as Colab notebooks. The easiest way to use the notebooks is to upload them to and run them in Google Colab. The notebooks contain imports of the necessary Python libraries. There are no dependencies and no additional modules or libraries need to be installed.

Known Limitations

We suggest to use GPU runtime when running the notebooks for more efficient compiling.

Usage

There are 19 Colab notebooks arranged in 5 folders used for classifying companies based of their descriptions. Each of the notebooks processes the datasets in a specific way.

In the notebooks where OpenAI API is used, the "YOUR-API-KEY-HERE" section should be filled with your API key from OpenAI. Please follow the instructions in the notebooks.

This is the brief content for each of the folders:

AutoEncoder Approach
- An Autoencoder was used to reduce the unnecessary dimensions after embedding the sentences.
- An OneVsRest Classifier was then used for classification after reducing the dimensionality.
BERT Transformer
- This notebook used 'roberta-base' transformer and 'roberta-base' tokenizer for classification.
OneVsRest Classification
- This folder contains models utilizing the OneVsRest Classifier and the Support Vector Classifier as its estimator.
- For embedding the sentences, the 'all-mpnet-base-v2' and 'hkunlp_instructor-large' sentence transformers were used.
- OpenAI API was used for comparing the results of the model against ChatGPT predictions on the same data.
- Various techniques were used for cleaning the data.
Unsupervised Approach
- Agglomerative Clustering was used for clustering the 11 sectors in GICS.
- For embedding the sentences, the 'nli-distilroberta-base-v2' sentence transformer was used.
- To ease the visualization process, PCA was used for reducing the dimensionality.
Zero-Shot Classification
- This folder contains models utilizing the Zero-Shot classification pipeline using the 'valhalla/distilbart-mnli-12-3' model.
- TF-IDF was used for getting more precise representation of the sector names for enhancing the accuracy of the model.

Authors

Citation

If you use this project in your research, we would appreciate a citation to the following paper:

@article{rizinski2024comparative,
  title={Comparative Analysis of NLP-Based Models for Company Classification},
  author={Rizinski, Maryan and Jankov, Andrej and Sankaradas, Vignesh and Pinsky, Eugene and Mishkovski, Igor and Trajanov, Dimitar},
  journal={Information},
  volume={15},
  number={2},
  pages={77},
  year={2024},
  publisher={MDPI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
AutoEncoder Approach		AutoEncoder Approach
BERT Transformer		BERT Transformer
Datasets		Datasets
OneVsRest Classification		OneVsRest Classification
Unsupervised Approach		Unsupervised Approach
Visualization		Visualization
Zero-Shot Classification		Zero-Shot Classification
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Company Classification using NLP Approaches

Overview

Features

Installation

Known Limitations

Usage

Authors

Citation

About

Releases

Packages

Contributors 2

Languages

License

nubs4dayz/company-classification-research

Folders and files

Latest commit

History

Repository files navigation

Company Classification using NLP Approaches

Overview

Features

Installation

Known Limitations

Usage

Authors

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages