Topic modeling is your turf too.
Contextual topic models with representations from transformers.
This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
You can interactively explore clusters using datamapplot
directly in Turftopic!
You will first have to install datamapplot
for this to work.
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer
model = ClusteringTopicModel(feature_importance="centroid")
model.fit(corpus)
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig
If you are not running Turftopic from a Jupyter notebook, make sure to call
fig.show()
. This will open up a new browser tab with the interactive figure.
You can now use Semantic Signal Separation in a dynamic fashion. This allows you to investigate how semantic axes fluctuate over time, and how their content changes.
from turftopic import SemanticSignalSeparation
model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
model.plot_topics_over_time()
Turftopic can be installed from PyPI.
pip install turftopic
If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.
pip install turftopic[pyro-ppl]
Turftopic's models follow the scikit-learn API conventions, and as such they are quite easy to use if you are familiar with scikit-learn workflows.
Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(
subset="all",
remove=("headers", "footers", "quotes"),
)
corpus = newsgroups.data
Turftopic also comes with interpretation tools that make it easy to display and understand your results.
from turftopic import KeyNMF
model = KeyNMF(20).fit(corpus)
Turftopic comes with a number of pretty printing utilities for interpreting the models.
To see the highest the most important words for each topic, use the print_topics()
method.
model.print_topics()
# Print highest ranking documents for topic 0
model.print_representative_documents(0, corpus, document_topic_matrix)
model.print_topic_distribution(
"I think guns should definitely banned from all public institutions, such as schools."
)
Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
from turftopic import KeyNMF
from turftopic.namers import OpenAITopicNamer
model = KeyNMF(10).fit(corpus)
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)
model.print_topics()
Topic ID | Topic Name | Highest Ranking |
---|---|---|
0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
... |
Turftopic does not come with built-in visualization utilities, topicwizard, an interactive topic model visualization library, is compatible with all models from Turftopic.
pip install topic-wizard
By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.
import topicwizard
topicwizard.visualize(corpus, model=model)
Alternatively you can use the Figures API in topicwizard for individual HTML figures.