支持 gemini-pro-vision 多模态模型

readthecodes commented 11 months ago

zhu327 commented 11 months ago

The current requirement is not available at the moment.

duolabmeng6 commented 11 months ago

Hope to support gemini-pro-vision

zhu327 commented 11 months ago

Hope to support gemini-pro-vision

This feature is not a necessity for me; it might have to wait until someone pull requests it.

duolabmeng6 commented 11 months ago

Hello

Thank you for your prompt reply. I understand that the feature might not be a priority for you at the moment. However, I'd like to share my perspective on the potential value it could bring to the project, especially in relation to supporting "gemini-pro-vision."

In order to move this forward, I am willing to contribute code to implement this feature. I believe it not only aligns with my needs but could also benefit other users of the project. Any guidance or suggestions you can provide on how to proceed with this would be highly appreciated.

Thank you for your time and consideration.

zhu327 commented 11 months ago

@duolabmeng6 Welcome to submit your Pull Request! I have carefully examined the differences between the OpenAI API and the Gemini Pro version API, and I have some suggestions for your code implementation:

Currently, only gpt-4-vision-preview supports multimodal capabilities. Therefore, we recommend creating a model map. When the user submits a request with the model name as gpt-4-vision-preview, retrieve the Google model name gemini-pro-version from the map. For other model names provided by the user, use gemini-pro as the default.
The content structure in the request data submitted by gpt-4-vision-preview is different from the existing request structure. Consider how to handle this dynamic structure.
In gpt-4-vision-preview, images can be in the form of URLs or base64 encoding. However, gemini-pro-version in the SDK only supports bytes. Consider how to handle URLs, and whether it is necessary to fetch them locally.

All of the above can be referenced in openai documentation: OpenAI Vision API Guide. Thank you for your contribution!

zhu327 commented 11 months ago

In the documentation for the Gemini Pro version, I noticed a clarification stating that Gemini Pro does not have robust support for context. Therefore, it seems unnecessary to implement it.

https://ai.google.dev/tutorials/go_quickstart#multi-turn-conversations-chat

zhu327 commented 11 months ago

curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $YOUR_GOOGLE_AI_STUDIO_API_KEY" \
 -d '{
     "model": "gpt-4-vision-preview",
     "messages": [{"role": "user", "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
          }
        }
     ]}],
     "temperature": 0.7
 }'

Gemini Pro Vision supported, have fun😊

duolabmeng6 commented 11 months ago

Already used. Thank you very much.

zhu327 / gemini-openai-proxy

支持 gemini-pro-vision 多模态模型 #5