GPT-NeoX has publicly available weights after each training cycle. The embeddings at different pretraining times were utilized to see if they would lead to better topic modeling results.

Topic modeling across pretraining done alongside Dongchen Dai and Andreea Elena Bodea.

Topic modeling is the NLP task of extracting topics from text documents and typically used probabilistic models, such as Latent-Dirichlet Allocation (LDA) etc. Here topic modeling is done using the more recent and statistically better approach of using clustering methods on Large Language Model embeddings (most popularly BERTopic).

GPT-NeoX has publicly available weights after each training cycle. The embeddings at different pretraining times were utilized to see if they would lead to better topic modeling results.

Kmeans, kmedoids and Spherical-kmeans were used as the clustering methods.

Reuter8, 20newsgroup or the preprocessed version of 20newsgroup were used as the datasets for topic modeling to be applied to.

Due to memory constraints, it is advised to run this code on a server as even 1000 rows of data with maxlength of 100 characters will eat up the entire allocated memory in Google collab once embedded.

Both Principal Component Analysis (PCA) and UMAP are available for embedding dimension reduction but PCA is highly advised as it lead to better results.

The generated topics were evaluated based on Topic diversity and coherence. Results obtained from utilizing embeddings at different checkpoints varied greatly.

The findings can be discussed upon request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
topic_modeling_lm.py		topic_modeling_lm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-NeoX has publicly available weights after each training cycle. The embeddings at different pretraining times were utilized to see if they would lead to better topic modeling results.

Due to memory constraints, it is advised to run this code on a server as even 1000 rows of data with maxlength of 100 characters will eat up the entire allocated memory in Google collab once embedded.

The generated topics were evaluated based on Topic diversity and coherence. Results obtained from utilizing embeddings at different checkpoints varied greatly.

The findings can be discussed upon request.

About

Releases

Packages

Languages

faivan95/Topic-Modeling-Pythia

Folders and files

Latest commit

History

Repository files navigation

GPT-NeoX has publicly available weights after each training cycle. The embeddings at different pretraining times were utilized to see if they would lead to better topic modeling results.

Due to memory constraints, it is advised to run this code on a server as even 1000 rows of data with maxlength of 100 characters will eat up the entire allocated memory in Google collab once embedded.

The generated topics were evaluated based on Topic diversity and coherence. Results obtained from utilizing embeddings at different checkpoints varied greatly.

The findings can be discussed upon request.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages