Replies: 1 comment
-
Dimensionality reduction is generally used for the cluster model as most cluster models have difficulties handling high-dimensional data due to the curse of dimensionality. As such, the dimensionality that is done through In your first example, you are reducing the sparse matrix first to 100 dimensions before UMAP reduces it further to 5 dimensions. This is a trick often done to make it a more efficient procedure as sparse matrices can be difficult to compress with UMAP. So you would first reduce the sparse matrix to a dense 100-dimensional matrix, before further reducing it to 5 dimensions with UMAP. In your second example, there is no reduction to 5 dimensions which means that the resulting embeddings have 100 dimensions. 100 dimensions is quite a lot and will typically be difficult to handle for cluster techniques, which is why we generally opt for the former. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am working on clustering content within a document and consciously avoiding the use of embedding models. Instead, I plan to use Singular Value Decomposition (SVD). Previously, I experimented with a combination of TfidfVectorizer and TruncatedSVD. Now, I want to use BERTopic as the backend and try to understand how the various steps of the process are put together.
For example, the pipeline for the Scikit-Learn embeddings is given as:
An alternative setup would be using an embedding model (such as TfidfVectorizer) alongside SVD as the umap_model:
But I'm trying to understand why the subsequent steps are necessary and what effects they have. From what I understand, TruncatedSVD already performs dimensionality reduction. So, what role does dimensionality reduction and clustering play in this context? TruncatedSVD is also mentioned as the method for dimensionality reduction later on.
Therefore, my question is: What would a pipeline look like if I want to use SVD? Should I use the same vectorizer for both the embeddings and the dimensionality reduction?
Beta Was this translation helpful? Give feedback.
All reactions