Role of Dimensionality Reduction and Clustering with SVD #2294

Foorcee · 2025-02-20T17:05:38Z

Foorcee
Feb 20, 2025

Hi, I am working on clustering content within a document and consciously avoiding the use of embedding models. Instead, I plan to use Singular Value Decomposition (SVD). Previously, I experimented with a combination of TfidfVectorizer and TruncatedSVD. Now, I want to use BERTopic as the backend and try to understand how the various steps of the process are put together.

For example, the pipeline for the Scikit-Learn embeddings is given as:

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(100)
)

topic_model = BERTopic(embedding_model=pipe)

An alternative setup would be using an embedding model (such as TfidfVectorizer) alongside SVD as the umap_model:

topic_model = BERTopic(embedding_model=TfidfVectorizer(), umap_model=TruncatedSVD(100))

But I'm trying to understand why the subsequent steps are necessary and what effects they have. From what I understand, TruncatedSVD already performs dimensionality reduction. So, what role does dimensionality reduction and clustering play in this context? TruncatedSVD is also mentioned as the method for dimensionality reduction later on.

Therefore, my question is: What would a pipeline look like if I want to use SVD? Should I use the same vectorizer for both the embeddings and the dimensionality reduction?

MaartenGr · 2025-02-23T08:43:46Z

MaartenGr
Feb 23, 2025
Maintainer

Dimensionality reduction is generally used for the cluster model as most cluster models have difficulties handling high-dimensional data due to the curse of dimensionality. As such, the dimensionality that is done through umap_model (to reduce it to typically 5 dimensions) is to improve the clustering performance.

In your first example, you are reducing the sparse matrix first to 100 dimensions before UMAP reduces it further to 5 dimensions. This is a trick often done to make it a more efficient procedure as sparse matrices can be difficult to compress with UMAP. So you would first reduce the sparse matrix to a dense 100-dimensional matrix, before further reducing it to 5 dimensions with UMAP.

In your second example, there is no reduction to 5 dimensions which means that the resulting embeddings have 100 dimensions. 100 dimensions is quite a lot and will typically be difficult to handle for cluster techniques, which is why we generally opt for the former.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Role of Dimensionality Reduction and Clustering with SVD #2294

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Role of Dimensionality Reduction and Clustering with SVD #2294

Foorcee Feb 20, 2025

Replies: 1 comment

MaartenGr Feb 23, 2025 Maintainer

Foorcee
Feb 20, 2025

MaartenGr
Feb 23, 2025
Maintainer