Closed readthecodes closed 11 months ago
The current requirement is not available at the moment.
Hope to support gemini-pro-vision
Hope to support gemini-pro-vision
This feature is not a necessity for me; it might have to wait until someone pull requests it.
Hello
Thank you for your prompt reply. I understand that the feature might not be a priority for you at the moment. However, I'd like to share my perspective on the potential value it could bring to the project, especially in relation to supporting "gemini-pro-vision."
In order to move this forward, I am willing to contribute code to implement this feature. I believe it not only aligns with my needs but could also benefit other users of the project. Any guidance or suggestions you can provide on how to proceed with this would be highly appreciated.
Thank you for your time and consideration.
@duolabmeng6 Welcome to submit your Pull Request! I have carefully examined the differences between the OpenAI API and the Gemini Pro version API, and I have some suggestions for your code implementation:
Currently, only gpt-4-vision-preview
supports multimodal capabilities. Therefore, we recommend creating a model map. When the user submits a request with the model name as gpt-4-vision-preview
, retrieve the Google model name gemini-pro-version
from the map. For other model names provided by the user, use gemini-pro
as the default.
The content structure in the request data submitted by gpt-4-vision-preview
is different from the existing request structure. Consider how to handle this dynamic structure.
In gpt-4-vision-preview
, images can be in the form of URLs or base64 encoding. However, gemini-pro-version
in the SDK only supports bytes. Consider how to handle URLs, and whether it is necessary to fetch them locally.
All of the above can be referenced in openai documentation: OpenAI Vision API Guide. Thank you for your contribution!
In the documentation for the Gemini Pro version, I noticed a clarification stating that Gemini Pro does not have robust support for context. Therefore, it seems unnecessary to implement it.
https://ai.google.dev/tutorials/go_quickstart#multi-turn-conversations-chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_GOOGLE_AI_STUDIO_API_KEY" \
-d '{
"model": "gpt-4-vision-preview",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]}],
"temperature": 0.7
}'
Gemini Pro Vision supported, have fun😊
Already used. Thank you very much.
支持 gemini-pro-vision 多模态模型