Restructured docs in a more sensible way

x-tabdeveloping · Jan 7, 2025 · b22ef1d · b22ef1d
1 parent 52e78ef
commit b22ef1d
Showing 1 changed file with 108 additions and 75 deletions.
diff --git a/docs/model_definition_and_training.md b/docs/model_definition_and_training.md
@@ -1,125 +1,155 @@
 # Defining and Training Topic Models
 
-To get started using Turftopic you will need to load and fit a topic model.
-This page provides more in-depth information on how to do these.
+In order to start modeling your corpora, you will need to define a topic model.
+There are a wide array of available models in Turftopic that all have their unique behaviour.
+On the other hand all models will need to have certain components, and have attributes you can adjust to your needs.
+This page provides a guide on how to define models, train them, and use them for inference.
+
+<figure>
+  <img src="../images/topic_modeling_pipeline.png" width="800px" style="margin-left: auto;margin-right: auto;">
+  <figcaption>Components of a Topic Modeling Pipeline</figcaption>
+</figure>
+
+
+## Defining a Model
+
+### 1. [Topic Model](../models.md)
+ In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
+You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
+
+Here are some examples of models you can load and use in the package:
+
+<table>
+<tr>
+<td> Model </td> <td> Example Definition </td>
+</tr>
+<tr>
+<td>
+
+<a href="https://x-tabdeveloping.github.io/turftopic/KeyNMF/">KeyNMF</a>
+
+</td>
+<td>
 
 ```python
 from turftopic import KeyNMF
 
-model = KeyNMF(20).fit(corpus)
+model = KeyNMF(n_components=10, top_n=15)
 ```
 
-## Important Attributes
+</td>
+</tr>
+<tr>
+<td>
 
-In Turftopic all models have a vectorizer and an encoder component, which you can specify when initializing a model.
+<a href="https://x-tabdeveloping.github.io/turftopic/clustering/">ClusteringTopicModel</a>
 
-1. The __vectorizer__ is used to turn documents into Bag-of-Words representations and learning the vocabulary. The default used in the package is sklearn's `CountVectorizer`.
-1. The __encoder__ is used to encode documents, and optionally the vocabulary into contextual representations. This will most frequently be a Sentence Transformer. The default in Turftopic is `all-MiniLM-L6-v2`, a very lightweight English model.
+</td>
+<td>
 
-You can use any of the built-in encoders in Turftopic to encode your documents, or any sentence transformer from the HuggingFace Hub.
-This allows you to use embeddings of different quality and computational efficiency for different purposes.
+```python
+from turftopic import ClusteringTopicModel
 
-Here's a model that uses E5 large as the embedding model, and only learns words that occur in at least 20 documents.
+model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
+```
+
+</td>
+</tr>
+<tr>
+<td>
+
+<a href="https://x-tabdeveloping.github.io/turftopic/s3/">SemanticSignalSeparation</a>
+
+</td>
+<td>
 
 ```python
 from turftopic import SemanticSignalSeparation
-from sklearn.feature_extraction.text import CountVectorizer
 
-model = SemanticSignalSeparation(10, encoder="all-MiniLM-L6-v2", vectorizer=CountVectorizer(min_df=20))
+model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
 ```
 
-You can also use external models for encoding, here's an example with [OpenAI's embedding models](encoders.md#external_embeddings):
+</td>
+</tr>
+</table>
 
-```python
-from turftopic import GMM
-from turftopic.encoders import OpenAIEmbeddings
+### 2. [Vectorizer](../vectorizers.md)
 
-model = GMM(10, encoder=OpenAIEmbeddings("text-embedding-3-large"))
-```
+In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
+This means, that a vectorizer also determines which words will be part of the model's vocabulary.
+For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
 
-If you intend to, you can also use n-grams as features instead of words:
+The default is scikit-learn's CountVectorizer:
 
 ```python
-from turftopic import GMM
 from sklearn.feature_extraction.text import CountVectorizer
 
-model = GMM(10, vectorizer=CountVectorizer(ngram_range=(2,4)))
+default_vectorizer = CountVectorizer(min_df=10, stop_words="english")
 ```
 
-## Fitting Models
-
-All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
+You can add a custom vectorizer to a topic model upon initializing it,
+thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using NounPhraseCountVectorizer:
 
-Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
+```bash
+pip install turftopic[spacy]
+python -m spacy download "en_core_web_sm"
+```
 
 ```python
-corpus: list[str] = ["this is a a document", "this is yet another document", ...]
+from turftopic import KeyNMF
+from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
 
-model.fit(corpus)
+model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer())
 ```
 
-## Prompting Embedding Models
+### 3. [Encoder](../encoders.md)
 
-Some embedding models can be used together with prompting, or encode queries and passages differently.
-This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
-Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.
+Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
+The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
+You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
+See a detailed guide on Encoders [here](../encoders.md).
 
-In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
+Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
 
-Here's an example for clustering models:
 ```python
-from turftopic import ClusteringTopicModel
+from turftopic import KeyNMF
 from sentence_transformers import SentenceTransformer
 
-encoder = SentenceTransformer(
-    "intfloat/multilingual-e5-large-instruct",
-    prompts={
-        "query": "Instruct: Cluster documents according to the topic they are about. Query: "
-        "passage": "Passage: "
-    },
-    # Make sure to set default prompt to query!
-    default_prompt_name="query",
-)
-model = ClusteringTopicModel(encoder=encoder)
+encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
+model = KeyNMF(10, encoder=encoder)
 ```
 
-You can also use instruct models for keyword retrieval with KeyNMF.
-In this case, documents will serve as the queries and words as the passages:
+### 4. [Namer](../namers.md) (*optional*)
+
+A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
+Namers are technically **not part of your topic model**, and should be used *after training*.
+See a detailed guide [here](../namers.md).
 
 ```python
 from turftopic import KeyNMF
-from sentence_transformers import SentenceTransformer
+from turftopic.namers import LLMTopicNamer
 
-encoder = SentenceTransformer(
-    "intfloat/multilingual-e5-large-instruct",
-    prompts={
-        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
-        "passage": "Passage: "
-    },
-    # Make sure to set default prompt to query!
-    default_prompt_name="query",
-)
-model = KeyNMF(10, encoder=encoder)
+model = KeyNMF(10).fit(corpus)
+namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
+
+model.rename_topics(namer)
 ```
 
-When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:
+## Training and Inference
+
+### Model Training
+
+All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
+
+Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
 
 ```python
-encoder = SentenceTransformer(
-    "intfloat/e5-large-v2",
-    prompts={
-        "query": "query: "
-        "passage": "passage: "
-    },
-    # Make sure to set default prompt to query!
-    default_prompt_name="query",
-)
-model = KeyNMF(10, encoder=encoder)
-```
+corpus: list[str] = ["this is a a document", "this is yet another document", ...]
 
-Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
+model.fit(corpus)
+```
 
-## Precomputing Embeddings
+### Precomputing Embeddings
 
 In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
 Encoding the corpus is the heaviest part of the process and you can spare yourself a lot of time by only doing it once.
@@ -147,10 +177,16 @@ gmm = GMM(10, encoder=encoder).fit(corpus, embeddings=embeddings)
 clustering = ClusteringTopicModel(encoder=encoder).fit(corpus, embeddings=embeddings)
 ```
 
-## Inference
+### Inference
 
+Some models in Turftopic are capable of estimating topic importance scores for documents in your corpus.
 In order to get the importance of each topic for the documents in the corpus, you might want to use `fit_transform()` instead of `fit()`
 
+!!! warning
+    Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
+    For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
+    Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
+
 ```python
 document_topic_matrix = model.fit_transform(corpus)
 ```
@@ -163,9 +199,6 @@ You can infer topical content for new documents with a fitted model using the `t
 document_topic_matrix = model.transform(new_documents, embeddings=None)
 ```
 
- > Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
- > For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
- > Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.