feat: Add support for Llama 3.2-Vision models

This pull request introduces support for the Llama 3.2-Vision collection of multimodal large language models (LLMs) within Xinference. These models bring the capability to process both text and image inputs, expanding the potential for diverse applications.

Key Changes:

Expanded Model Support: Adds Llama 3.2-Vision and Llama 3.2-Vision-Instruct models to the list of supported models, accessible through both the transformers and vllm engines.
Vllm Engine Enhancement: Updates the vllm engine to accommodate the specific requirements of the Llama 3.2-Vision models.
Documentation Updates: Improves the documentation to include details about the newly supported models, guiding users on their effective utilization.

This pull request adds support for the Llama 3.2-Vision collection of multimodal LLMs for both the transformers and vllm engines.

Updated llm_family.json and llm_family_modelscope.json to include Llama 3.2-Vision and Llama 3.2-Vision-Instruct model information.
Modified vllm engine's core.py to handle these models.
Enhanced documentation with model reference files to reflect the newly supported built-in models.

xorbitsai / inference

feat: Add support for Llama 3.2-Vision models #2376