Extract topic names with LLMs

x-tabdeveloping commented 4 weeks ago

This is still a feature we're missing, that is present in other topic modeling libraries, such as BERTopic or BunkaTopics, and it seems that there is industry interest in this.

mjaniec2013 commented 1 week ago

It's relatively simple to implement, but one should not expect miracles (i.e. topic labels more informative than the extracted keywords).

Sample implementation using groq below. The prompt used is a modified combination of prompts from BERTopic and BunkaTopics.


from groq_ineference import *

create_topic_label = (
    "I have a topic that is described the following keywords:\n",
    "\n<KEYWORDS>\n\n",
    "Based on the keywords about the topic, create a short label (3 words maximum) that summarizes best the topic.\n",
    "The topic may cover various disconnected sub-topics, signaled by the keywords. Create a short topic label encompassing the largest possible coherent theme.\n",
    "Only give the name of the topic and nothing else:"
)

for topic_keywords in topics_keywords_list:

    create_label = re.sub(pattern='<KEYWORDS>', repl=str(topic_keywords), string="".join(create_topic_label))

    messages = [
        create_chat_message(role='system', content='You are a helpful assistant in Topic Modeling.'),
        create_chat_message(role='user', content=create_label)
    ]

    response = groq_completion(messages=messages, model='llama3-70b-8192', max_tokens=128)

    print(f'\n{response}\n{str(topic_keywords)}\n')

_topickeywords were extracted from the _gettopics output.

x-tabdeveloping commented 1 week ago

@mjaniec2013 Thanks for the suggestions! I think there are still more considerations that have to go into how this will be done in practice, but I really appreciate your efforts, especially in the prompting department.

I think that allowing people to use :hugs: Transformers and LangChain is probably the most reasonable choice. I will think a bit about how to include this in the library in a sensible way.

If you are looking to contribute I would really appreciate if you could look into what prompts result in reasonably good interpretations with $S^3$. As you might know, that model interprets topics as semantic axes and the lowest ranking terms should also be used when interpreting them.

x-tabdeveloping commented 4 days ago

So I started working on this on the llm_naming branch, and added base API code, and code for using :hugs: Transformers for naming topics (text2text, chat).

Some of the considerations to take into account when making default choices about prompts and models:

The default option should be a relatively small model (I've been experimenting with stabilityai/stablelm-2-1_6b-chat, stabilityai/stablelm-2-zephyr-1_6b and Google's FLAN T5 models), so that users can run it on their own machine and don't need to pay third party services or get compute to run them. This is a challenge in prompt engineering because unless we have robust prompts these models are not particularly good at this task.
We should not introduce any dependencies to the library, and the optional ones should be kept at a minimum too. I think transformers and langchain are reasonable industry standards and cover 90% of what the average users will want to use.
We should be able to use topic namers with all topic models, including $S^3$, which interprets topics as semantic axes, and axes can have negative descriptive terms (read our paper for more detail)
One should not have to use topic namers, they should be as opt-in as it gets.

I made the following choices so far:

Namers are a separate class that have a def name_topic(self, positive: list[str], negative: list[str]) -> str method.
Topic models have a name_topics() method that take a TopicNamer, note that the namer is not an attribute of the topic model, and only interacts with the model when the topics are named.
Topics are only named when the user calls the method, and not anytime else, as naming them is a potentially costly and slow operation.

Here's some example code of how it works so far:

from turftopic import KeyNMF
from turftopic.namers.hf_transformers import (ChatTopicNamer,
                                              Text2TextTopicNamer)

prompt = """
I have a topic, which can be described with these keywords: {positive}.
What is the topic about? Respond with a short name only for the topic and nothing else.
"""

system_prompt = """
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
You only repond briefly with the name of the topic, and nothing else.
"""

model = KeyNMF(10)
model.fit(corpus)

namer = ChatTopicNamer("stabilityai/stablelm-2-1_6b-chat", prompt_template=prompt, system_prompt=system_prompt)
topic_names = model.name_topics(namer)
model.print_topics()

Note that prompt templates are infused with information using python's built-in str.format() method and values should be put in-between curly braces in the prompt.

If you want to spend some time engineering prompts @mjaniec2013, I encourage you to do so in the example code above. I have so far failed to get a prompt that produces satisfactory results, maybe you have more experience or better luck :D

mjaniec2013 commented 3 days ago

I'd gladly assist in engineering the keywords to topic label prompt.

Below is an example of the output of the proposed earlier prompt with groq LLaMA 3-70B model:

Meta AI Assistants ['llama', 'meta_ai', 'ai', 'meta_llama', 'llms', 'generative_ai', 'assistant', 'language']

Human Memory Function ['memory', 'memories', 'term_memory', 'recall', 'forgetting', 'brain', 'term_memories', 'hippocampus', 'remember', 'psychology']

Contextual AI Memory ['term_memory', 'memory', 'conversation', 'ai', 'context', 'language', 'text', 'dialogues']

Conversational AI Systems ['conversationchain', 'conversational_memory', 'conversationbuffermemory', 'langchain', 'ai_talkative', 'chatbots', 'conversation_human']

Language Model AI ['large_language', 'language_models', 'language_modeling', 'llms', 'natural_language', 'models_llms', 'learning', 'ai']

AI Industry Tech ['artificial_intelligence', 'ai', 'intelligence', 'automation', 'autonomous', 'technology', 'industry', 'generative_ai', 'nlp', 'companies']

Alzheimer's Brain Disease ['alzheimer', 'dementia', 'brain', 'hippocampus', 'memory', 'cognitive', 'disease', 'term_memory']

AI Computing Hardware ['gpus', 'generative_ai', 'nvidia', 'computing', 'ai', 'cpu', 'workloads', 'optimizations', 'processors']

Investing and Finance ['investments', 'stocks', 'investing', 'companies', 'investors', 'life_insurance', 'finance', 'diversify', 'financial']

Memory and Cognition ['episodic_memory', 'memory', 'semantic_memory', 'term_memory', 'cognitive', 'sensory_memory', 'remembering', 'psychology', 'episodic_semantic', 'stimuli']

Brain Memory Processes ['memory_consolidation', 'hippocampal', 'hippocampus', 'prefrontal_cortex', 'brain', 'neurobiological', 'neuropsychologia', 'cognition', 'neuropsychology', 'neurobiology']

AI Language Models ['openai', 'ai', 'chatbot', 'google', 'chatgpt', 'language', 'llama', 'datasets']

Chatbot Conversation Model ['chat_history', 'langchain', 'conversationbuffermemory', 'chat_messages', 'chat_memory', 'chatbot', 'humanmessage', 'chatmodels']

Format:

LLM generated label
topic keywords passed to LLM (up to 10 per topic, with some inflection-based reduction).

x-tabdeveloping / turftopic

Extract topic names with LLMs #44