Bring your own multimodal model?

frickp commented 1 month ago

Thanks for this really interesting project, it looks promising and I'd like to see if it will work for my multimodal prompt optimization. I see in this code that there is a pre-defined set of models that can be multimodal and this blocks my usage. I am using Gemini, but generally it could be more flexible to allow users to bring any multimodal model. I wonder if you think it would work to register an engine with a self.multimodal attribute or something?

mertyg commented 1 month ago

Yes this would be great to have. I am not sure what is the best way to let users bring any multimodal model though -- for text, we can just use APIs that can be served through OpenAI clients like this; but with images I am not sure if there is an easy way. Do you have any suggestions?

frickp commented 1 month ago

I was able to do it with Gemini Pro by subclassing the engine and overwriting the "generate" method. This is more overhead for the user, but allows it to work with arbitrary API designs across vendors. Not sure if you have a plan to loop in the huggingface ecosystem of models, but that may be relevant to this as well. Here it looks like if you can get it to generate with a list of inputs and a system prompt it should work for multimodal optimization.

I can do it with Gemini like this:

class GeminiMM(EngineLM, CachedEngine):
    def __init__(
        self,
        gemini_model: GenerativeModel, # This a model class from Google
        generation_config, # something Gemini expects, could be refactored
        safety_settings, # something Gemini expects, could be refactored
        model_string='gemini-vertex',
        system_instruction="You are a financial assistant."""
    ):
        self.model = gemini_model
        self.generation_config = generation_config # something Gemini expects, could be refactored
        self.safety_settings = safety_settings # something Gemini expects, could be refactored
        self.model_string = model_string

    @retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(5))
    def __call__(self, prompt, system_prompt, **kwargs):
        return self.generate(prompt, system_prompt, **kwargs)

    def generate(
        self,
        contents: list[Union[str, Part]], # this can accept text, image, in any order
        system_prompt,
    ):
    """Custom code that varies according to what LLM needs in the `generate` method."""
        self.model._system_instruction = system_prompt
        generation = self.model.generate_content(
            contents,
            generation_config=self.generation_config, # something Gemini expects, could be refactored
            safety_settings=self.safety_settings, # something Gemini expects, could be refactored
        )
        return generation.text

If the validation check was in EngineLM I think I could get around the error.

zou-group / textgrad

Bring your own multimodal model? #94