Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization utilities #76

Merged
merged 28 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
65e598f
Added dot product as a metric for KeyNMF
x-tabdeveloping Jan 2, 2025
cb49e66
Added Chinese vectorizer
x-tabdeveloping Jan 2, 2025
c0fdbfc
Fixed tokenization
x-tabdeveloping Jan 2, 2025
475a333
Added Jieba as optional dependency
x-tabdeveloping Jan 2, 2025
b69b903
Added Chinese modeling to docs
x-tabdeveloping Jan 2, 2025
946e05a
Added installation disclaimer to docs
x-tabdeveloping Jan 2, 2025
a897594
Version bump
x-tabdeveloping Jan 2, 2025
c599ec8
Updated readme
x-tabdeveloping Jan 2, 2025
af9f2a7
Moved chinese modeling to the Tutorials section and added data to tut…
x-tabdeveloping Jan 3, 2025
7d1eb3a
Relaxed huggingface-hub version
x-tabdeveloping Jan 3, 2025
b386a35
Added tutorial for keyphrase topic modeling
x-tabdeveloping Jan 3, 2025
a3db74b
Created a vectorizers and moved the default, as well as chinese vecto…
x-tabdeveloping Jan 5, 2025
c729de1
Updated docs to reflect changes
x-tabdeveloping Jan 5, 2025
bfb21f5
Added SpaCy-powered Lemma and NounPhrase vectorizers
x-tabdeveloping Jan 6, 2025
8d69868
Added spacy as an optional dependency
x-tabdeveloping Jan 6, 2025
4de6d14
Fixed spacy import error message
x-tabdeveloping Jan 6, 2025
0f127fa
Added snowball stemming vectorizer
x-tabdeveloping Jan 6, 2025
24eaa9e
Added docstring to spacy vectorizers
x-tabdeveloping Jan 6, 2025
3ed07d2
draft: started working on docs
x-tabdeveloping Jan 6, 2025
14443c1
Added vectorizers to docs
x-tabdeveloping Jan 6, 2025
52e78ef
Started restructuring docs
x-tabdeveloping Jan 6, 2025
b22ef1d
Restructured docs in a more sensible way
x-tabdeveloping Jan 7, 2025
308806f
Updated Readme
x-tabdeveloping Jan 7, 2025
b5145bf
reformatted readme example
x-tabdeveloping Jan 7, 2025
8d617de
Added reference to examples chinese page
x-tabdeveloping Jan 7, 2025
e5e6500
Added TokenCountVectorizer
x-tabdeveloping Jan 8, 2025
ad4ecb0
Added TokenCountVectorizer to docs, moved some things into tabs
x-tabdeveloping Jan 8, 2025
c166deb
Updated readme
x-tabdeveloping Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 24 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,48 +16,46 @@
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Automated topic naming with LLMs
- Topic modeling with keyphrases :key:
- Lemmatization and Stemming
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.

### New in version 0.10.0
## New in version 0.11.0: Vectorizers Module

You can interactively explore clusters using `datamapplot` directly in Turftopic!
You will first have to install `datamapplot` for this to work.
You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.

```python
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = ClusteringTopicModel(feature_importance="centroid")
model = KeyNMF(
n_components=10,
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
)
model.fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig
model.print_topics()
```
> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.

<figure>
<img src="docs/images/cluster_datamapplot.png" width="70%" style="margin-left: auto;margin-right: auto;">
<figcaption>Interactive figure to explore cluster structure in a clustering topic model.</figcaption>
</figure>

### New in version 0.9.0
| Topic ID | Highest Ranking |
| - | - |
| | ... |
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
| | ... |

#### Dynamic S³ 🧭
Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**.

You can now use Semantic Signal Separation in a dynamic fashion.
This allows you to investigate how semantic axes fluctuate over time, and how their content changes.
```python
from turftopic import SemanticSignalSeparation
from turftopic.vectorizers.chinese import default_chinese_vectorizer
from turftopic.vectorizers.spacy import TokenCountVectorizer

model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
chinese_vectorizer = default_chinese_vectorizer()
arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True)
danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True)
...

model.plot_topics_over_time()
```


Expand Down
Loading
Loading