xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.79k stars 378 forks source link

FLUX.1-schnell获取图像报错 #2213

Closed cnzayn closed 1 week ago

cnzayn commented 2 weeks ago

System Info / 系統信息

NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

v0.14.3

The command used to start Xinference / 用以启动 xinference 的命令

启动docker命令:docker run -itd --name xinference15 -v /data/model_zoo:/model_zoo -v /data/xinference/:/xinference/ -e XINFERENCE_MODEL_SRC=modelscope -e XINFERENCE_HOME=/xinference -p 9989:9997 --gpus all registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.14.3 xinference-local -H 0.0.0.0

启动FLUX.1-schnell命令:xinference launch --model-path /xinference/modelscope/hub/AI-ModelScope/FLUX___1-schnell --model-name FLUX.1-schnell --model-type image --quantize_text_encoder text_encoder_2

Reproduction / 复现过程

  1. 通过命令启动FLUX.1-schnell,启动成功。
  2. 通过http请求(v1/images/generations)获取图像,报错,报错内容如下: RuntimeError: Failed to create the images, detail: [address=0.0.0.0:41935, pid=1324] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
  3. FLUX.1-dev,也是一样的问题

Expected behavior / 期待表现

能通过FLUX.1-schnell正常获取图像

cnzayn commented 2 weeks ago
  1. xinference launch命令后面加上--cpu_offload True参数后,报错: RuntimeError: Failed to launch model, detail: [address=0.0.0.0:43747, pid=2099] It seems like you have activated a device mapping strategy on the pipeline so calling enable_model_cpu_offload() isn't allowed. You can callreset_device_map()first and then callenable_model_cpu_offload()`.
  2. 修改xinference/model/image/stable_diffusion/core.py中的load函数,在enable_model_cpu_offload()前面加reset_device_map()函数,报错: RuntimeError: Failed to launch model, detail: [address=0.0.0.0:36197, pid=5097] .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.
qinxuye commented 2 weeks ago

看下 diffusers 的版本。

cnzayn commented 1 week ago

@qinxuye diffusers的版本是0.30.1

qinxuye commented 1 week ago

使用 quantize 是报啥错?

cnzayn commented 1 week ago

quantize过程不报错,在使用xinference launch --model-path /xinference/modelscope/hub/AI-ModelScope/FLUX___1-schnell --model-name FLUX.1-schnell --model-type image --quantize_text_encoder text_encoder_2命令启动的日志:

2024-09-03 03:51:51,676 xinference.model.utils 92 INFO Use model cache from a different hub. /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") 2024-09-03 03:51:53,375 transformers.configuration_utils 6227 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder_2/config.json 2024-09-03 03:51:53,376 transformers.configuration_utils 6227 INFO Model config T5Config { "_name_or_path": "google/t5-v1_1-xxl", "architectures": [ "T5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 32128 }

2024-09-03 03:51:53,376 transformers.quantizers.quantizer_bnb_8bit 6227 INFO Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning. 2024-09-03 03:51:53,377 transformers.quantizers.quantizer_bnb_8bit 6227 INFO The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for inference, please set device_map ='auto' 2024-09-03 03:51:53,377 transformers.modeling_utils 6227 WARNING low_cpu_mem_usage was None, now set to True since model is quantized. 2024-09-03 03:51:53,377 transformers.modeling_utils 6227 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder_2/model.safetensors.index.json 2024-09-03 03:51:53,377 transformers.modeling_utils 6227 INFO Instantiating T5EncoderModel model under default dtype torch.float16. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.61s/it] 2024-09-03 03:51:58,788 transformers.modeling_utils 6227 INFO All model checkpoint weights were used when initializing T5EncoderModel.

2024-09-03 03:51:58,789 transformers.modeling_utils 6227 INFO All the weights of T5EncoderModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5EncoderModel for predictions without further training. Keyword arguments {'lora_model_paths': None, 'model-path': '/xinference/modelscope/hub/AI-ModelScope/FLUX___1-schnell'} are not expected by FluxPipeline and will be ignored. 2024-09-03 03:51:58,957 transformers.configuration_utils 6227 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 03:51:58,958 transformers.configuration_utils 6227 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

Loading pipeline components...: 14%|████████████████████▎ | 1/7 [00:00<00:02, 2.04it/s]2024-09-03 03:51:59,728 transformers.tokenization_utils_base 6227 INFO loading file spiece.model 2024-09-03 03:51:59,728 transformers.tokenization_utils_base 6227 INFO loading file tokenizer.json 2024-09-03 03:51:59,728 transformers.tokenization_utils_base 6227 INFO loading file added_tokens.json 2024-09-03 03:51:59,728 transformers.tokenization_utils_base 6227 INFO loading file special_tokens_map.json 2024-09-03 03:51:59,728 transformers.tokenization_utils_base 6227 INFO loading file tokenizer_config.json 2024-09-03 03:51:59,729 transformers.models.t5.tokenization_t5_fast 6227 WARNING You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Loading pipeline components...: 29%|████████████████████████████████████████▌ | 2/7 [00:00<00:01, 2.96it/s]2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file vocab.json 2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file merges.txt 2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file added_tokens.json 2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file special_tokens_map.json 2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file tokenizer_config.json 2024-09-03 03:51:59,959 transformers.tokenization_utils_base 6227 INFO loading file tokenizer.json 2024-09-03 03:51:59,960 transformers.models.clip.tokenization_clip 6227 INFO ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy. 2024-09-03 03:52:00,030 transformers.configuration_utils 6227 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 03:52:00,031 transformers.configuration_utils 6227 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 03:52:00,032 transformers.modeling_utils 6227 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder/model.safetensors 2024-09-03 03:52:00,040 transformers.modeling_utils 6227 INFO Instantiating CLIPTextModel model under default dtype torch.float16. 2024-09-03 03:52:00,480 transformers.modeling_utils 6227 INFO All model checkpoint weights were used when initializing CLIPTextModel.

2024-09-03 03:52:00,480 transformers.modeling_utils 6227 INFO All the weights of CLIPTextModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell/text_encoder. If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPTextModel for predictions without further training.

qinxuye commented 1 week ago

所以 quantize 能正常运行吗?

cnzayn commented 1 week ago

在调用v1/images/generations接口获取图像时的日志:

2024-09-03 03:58:42,855 xinference.api.restful_api 1 ERROR [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 222, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 175, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 527, in process_events response = await route_utils.call_process_api( File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 261, in call_process_api output = await app.get_blocks().process_api( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1786, in process_api result = await self.call_function( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1338, in call_function prediction = await anyio.to_thread.run_sync( File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread return await future File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 859, in run result = context.run(func, args) File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 759, in wrapper response = f(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/image_interface.py", line 96, in text_generate_image response = model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xinference/client/restful/restful_client.py", line 227, in text_to_image raise RuntimeError( RuntimeError: Failed to create the images, detail: [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

qinxuye commented 1 week ago

你这是开了 cpu_offload 的报错吧,只用 quantize 有报错吗

cnzayn commented 1 week ago

没开cpu_offload,只用quantize加载模型不报错,生图时报错,如果开cpu_offload(launch时,带上--cpu_offload True参数),launch模型就报错了,报错的日志上面有贴

qinxuye commented 1 week ago

但我看你最近一个报错是开了 cpu_offload 才会出的问题,你再确认一下。

cnzayn commented 1 week ago

xinference launch命令后面加上--cpu_offload True参数后,报错: RuntimeError: Failed to launch model, detail: [address=0.0.0.0:43747, pid=2099] It seems like you have activated a device mapping strategy on the pipeline so calling enable_model_cpu_offload() isn't allowed. You can call reset_device_map()first and then callenable_model_cpu_offload()`. 修改xinference/model/image/stable_diffusion/core.py中的load函数,在enable_model_cpu_offload()前面加reset_device_map()函数,报错: RuntimeError: Failed to launch model, detail: [address=0.0.0.0:36197, pid=5097] .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

cnzayn commented 1 week ago

哪个地方显示开了cpu_offload ?

qinxuye commented 1 week ago

在调用v1/images/generations接口获取图像时的日志:

在调用v1/images/generations接口获取图像时的日志:

2024-09-03 03:58:42,855 xinference.api.restful_api 1 ERROR [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 222, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 175, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call* noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 527, in process_events response = await route_utils.call_process_api( File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 261, in call_process_api output = await app.get_blocks().process_api( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1786, in process_api result = await self.call_function( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1338, in call_function prediction = await anyio.to_thread.run_sync( File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread return await future File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 859, in run result = context.run(func, args) File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 759, in wrapper response = f(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/image_interface.py", line 96, in text_generate_image response = model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xinference/client/restful/restful_client.py", line 227, in text_to_image raise RuntimeError( RuntimeError: Failed to create the images, detail: [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

这个

qinxuye commented 1 week ago

一个个来,你先重启 xinference,然后用 quantize 加载模型,然后生图,有没有问题。我试下来没有问题

cnzayn commented 1 week ago

我把启动顺序再描述下:

  1. 我是通过docker部署的,docker版本是v0.14.3(registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference),启动后,进入到容器,使用xinference命令进行操作
  2. 下载的模型是通过xinference launch --model-name FLUX.1-schnell --model-type image命令,下载后是找到模型的位置并复制路径
  3. 通过xinference launch --model-path /xinference/modelscope/hub/AI-ModelScope/FLUX___1-schnell --model-name FLUX.1-schnell --model-type image --quantize_text_encoder text_encoder_2命令加载模型
  4. 通过调用v1/images/generations接口获取图像,调用参数{ "model": "FLUX.1-schnell", "prompt": "an apple"}
  5. 在加载模型时不报错,加载日志贴在上面
  6. 生成图像时报错,text_to_image raise RuntimeError( RuntimeError: Failed to create the images, detail: [address=0.0.0.0:33681, pid=6227] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
cnzayn commented 1 week ago

服务器的驱动及cuda版本:Driver Version: 550.54.14 CUDA Version: 12.4

qinxuye commented 1 week ago

显存有多大?

cnzayn commented 1 week ago

f66e43e1a6ec82ed3aefeaa88d77f6a

qinxuye commented 1 week ago

看下完整的日志吧,加载到推理的。

cnzayn commented 1 week ago

2024-09-03 05:39:58,476 xinference.model.utils 92 INFO Use model cache from a different hub. /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") 2024-09-03 05:40:00,139 transformers.configuration_utils 6475 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder_2/config.json 2024-09-03 05:40:00,140 transformers.configuration_utils 6475 INFO Model config T5Config { "_name_or_path": "google/t5-v1_1-xxl", "architectures": [ "T5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 32128 }

2024-09-03 05:40:00,141 transformers.quantizers.quantizer_bnb_8bit 6475 INFO Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning. 2024-09-03 05:40:00,142 transformers.quantizers.quantizer_bnb_8bit 6475 INFO The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for inference, please set device_map ='auto' 2024-09-03 05:40:00,142 transformers.modeling_utils 6475 WARNING low_cpu_mem_usage was None, now set to True since model is quantized. 2024-09-03 05:40:00,142 transformers.modeling_utils 6475 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder_2/model.safetensors.index.json 2024-09-03 05:40:00,142 transformers.modeling_utils 6475 INFO Instantiating T5EncoderModel model under default dtype torch.float16. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.84s/it] 2024-09-03 05:40:06,001 transformers.modeling_utils 6475 INFO All model checkpoint weights were used when initializing T5EncoderModel.

2024-09-03 05:40:06,002 transformers.modeling_utils 6475 INFO All the weights of T5EncoderModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5EncoderModel for predictions without further training. Keyword arguments {'lora_model_paths': None, 'model-path': '/xinference/modelscope/hub/AI-ModelScope/FLUX___1-schnell'} are not expected by FluxPipeline and will be ignored. 2024-09-03 05:40:06,169 transformers.configuration_utils 6475 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 05:40:06,170 transformers.configuration_utils 6475 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

Loading pipeline components...: 14%|████████████████████▎ | 1/7 [00:00<00:02, 2.05it/s]2024-09-03 05:40:06,939 transformers.tokenization_utils_base 6475 INFO loading file spiece.model 2024-09-03 05:40:06,940 transformers.tokenization_utils_base 6475 INFO loading file tokenizer.json 2024-09-03 05:40:06,940 transformers.tokenization_utils_base 6475 INFO loading file added_tokens.json 2024-09-03 05:40:06,940 transformers.tokenization_utils_base 6475 INFO loading file special_tokens_map.json 2024-09-03 05:40:06,940 transformers.tokenization_utils_base 6475 INFO loading file tokenizer_config.json 2024-09-03 05:40:06,941 transformers.models.t5.tokenization_t5_fast 6475 WARNING You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Loading pipeline components...: 29%|████████████████████████████████████████▌ | 2/7 [00:00<00:01, 2.97it/s]2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file vocab.json 2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file merges.txt 2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file added_tokens.json 2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file special_tokens_map.json 2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file tokenizer_config.json 2024-09-03 05:40:07,171 transformers.tokenization_utils_base 6475 INFO loading file tokenizer.json 2024-09-03 05:40:07,172 transformers.models.clip.tokenization_clip 6475 INFO ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy. 2024-09-03 05:40:07,241 transformers.configuration_utils 6475 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 05:40:07,243 transformers.configuration_utils 6475 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 05:40:07,243 transformers.modeling_utils 6475 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder/model.safetensors 2024-09-03 05:40:07,251 transformers.modeling_utils 6475 INFO Instantiating CLIPTextModel model under default dtype torch.float16. 2024-09-03 05:40:07,666 transformers.modeling_utils 6475 INFO All model checkpoint weights were used when initializing CLIPTextModel.

2024-09-03 05:40:07,666 transformers.modeling_utils 6475 INFO All the weights of CLIPTextModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell/text_encoder. If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPTextModel for predictions without further training. Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.43it/s] 0%| | 0/28 [00:00<?, ?it/s] 2024-09-03 05:40:36,300 xinference.api.restful_api 1 ERROR [address=0.0.0.0:34505, pid=6475] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 222, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 175, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:34505, pid=6475] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

qinxuye commented 1 week ago

哦,没有开 debug,xinference-local xxx --log-level debug 再试下。

cnzayn commented 1 week ago

是在加载的模型后面带上 --log-level debug参数,还是模型加载完了,执行xinference-local xxx --log-level debug整个命令?

qinxuye commented 1 week ago

是在加载的模型后面带上 --log-level debug参数,还是模型加载完了,执行xinference-local xxx --log-level debug整个命令?

是在启动服务的时候指定的。

cnzayn commented 1 week ago

root@b714ced99782:/opt/inference# xinference-local --log-level debug 2024-09-03 06:57:46,071 xinference.core.supervisor 3054 INFO Xinference supervisor 127.0.0.1:58162 started 2024-09-03 06:57:46,261 xinference.core.worker 3054 INFO Starting metrics export server at 127.0.0.1:None 2024-09-03 06:57:46,265 xinference.core.worker 3054 INFO Checking metrics export server... 2024-09-03 06:57:49,263 xinference.core.worker 3054 INFO Metrics server is started at: http://127.0.0.1:32883 2024-09-03 06:57:49,264 xinference.core.worker 3054 INFO Purge cache directory: /xinference/cache 2024-09-03 06:57:49,265 xinference.core.supervisor 3054 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f54c4f8af70>, '127.0.0.1:58162'), kwargs: {} 2024-09-03 06:57:49,265 xinference.core.supervisor 3054 DEBUG Worker 127.0.0.1:58162 has been added successfully 2024-09-03 06:57:49,265 xinference.core.supervisor 3054 DEBUG Leave add_worker, elapsed time: 0 s 2024-09-03 06:57:49,265 xinference.core.worker 3054 INFO Connected to supervisor as a fresh worker 2024-09-03 06:57:49,279 xinference.core.worker 3054 INFO Xinference worker 127.0.0.1:58162 started 2024-09-03 06:57:49,283 xinference.core.supervisor 3054 DEBUG Worker 127.0.0.1:58162 resources: {'cpu': ResourceStatus(usage=0.0, total=40, memory_used=3405799424, memory_available=196492734464, memory_total=201392889856), 'gpu-0': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-1': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-2': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-3': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-4': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-5': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-6': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664), 'gpu-7': GPUStatus(mem_total=11811160064, mem_free=11535974400, mem_used=275185664)} 2024-09-03 06:57:51,059 xinference.core.supervisor 3054 DEBUG Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f54c4f8af70>,), kwargs: {} 2024-09-03 06:57:51,059 xinference.core.supervisor 3054 DEBUG Leave get_status, elapsed time: 0 s 2024-09-03 06:57:52,637 xinference.api.restful_api 2968 INFO Starting Xinference at endpoint: http://127.0.0.1:9997 2024-09-03 06:57:52,853 xinference.api.restful_api 2968 WARNING Failed to create socket with port 9997 2024-09-03 06:57:52,861 xinference.api.restful_api 2968 INFO Found available port: 17087 2024-09-03 06:57:52,861 xinference.api.restful_api 2968 INFO Starting Xinference at endpoint: http://127.0.0.1:17087 2024-09-03 06:57:53,047 uvicorn.error 2968 INFO Uvicorn running on http://127.0.0.1:17087 (Press CTRL+C to quit) ^C2024-09-03 06:59:40,951 xinference.core.supervisor 3054 DEBUG Enter remove_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f54c4f8af70>, '127.0.0.1:58162'), kwargs: {} 2024-09-03 06:59:40,951 xinference.core.supervisor 3054 DEBUG Worker 127.0.0.1:58162 has been removed successfully 2024-09-03 06:59:40,952 xinference.core.supervisor 3054 DEBUG Leave remove_worker, elapsed time: 0 s

qinxuye commented 1 week ago

还有加载模型和运行的。在打开了 debug 情况下。

cnzayn commented 1 week ago

我打开了debug模式,但我加载和运行的日志跟之前没区别

qinxuye commented 1 week ago

会多一些 debug 的信息,我主要要看那些内容。

cnzayn commented 1 week ago

2024-09-03 07:09:19,288 uvicorn.access 1 INFO 127.0.0.1:45716 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-09-03 07:09:27,159 xinference.model.utils 92 INFO Use model cache from a different hub. 2024-09-03 07:09:28,821 transformers.configuration_utils 3539 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder_2/config.json 2024-09-03 07:09:28,822 transformers.configuration_utils 3539 INFO Model config T5Config { "_name_or_path": "google/t5-v1_1-xxl", "architectures": [ "T5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 32128 }

2024-09-03 07:09:28,823 transformers.quantizers.quantizer_bnb_8bit 3539 INFO Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning. 2024-09-03 07:09:28,823 transformers.quantizers.quantizer_bnb_8bit 3539 INFO The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for inference, please set device_map ='auto' 2024-09-03 07:09:28,824 transformers.modeling_utils 3539 WARNING low_cpu_mem_usage was None, now set to True since model is quantized. 2024-09-03 07:09:28,824 transformers.modeling_utils 3539 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder_2/model.safetensors.index.json 2024-09-03 07:09:28,824 transformers.modeling_utils 3539 INFO Instantiating T5EncoderModel model under default dtype torch.float16. 2024-09-03 07:09:34,687 transformers.modeling_utils 3539 INFO All model checkpoint weights were used when initializing T5EncoderModel.

2024-09-03 07:09:34,688 transformers.modeling_utils 3539 INFO All the weights of T5EncoderModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5EncoderModel for predictions without further training. 2024-09-03 07:09:34,771 transformers.configuration_utils 3539 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 07:09:34,772 transformers.configuration_utils 3539 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 07:09:35,157 transformers.configuration_utils 3539 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 07:09:35,158 transformers.configuration_utils 3539 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 07:09:35,158 transformers.modeling_utils 3539 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder/model.safetensors 2024-09-03 07:09:35,165 transformers.modeling_utils 3539 INFO Instantiating CLIPTextModel model under default dtype torch.float16. 2024-09-03 07:09:35,584 transformers.modeling_utils 3539 INFO All model checkpoint weights were used when initializing CLIPTextModel.

2024-09-03 07:09:35,585 transformers.modeling_utils 3539 INFO All the weights of CLIPTextModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell/text_encoder. If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPTextModel for predictions without further training. 2024-09-03 07:09:35,620 transformers.tokenization_utils_base 3539 INFO loading file spiece.model 2024-09-03 07:09:35,621 transformers.tokenization_utils_base 3539 INFO loading file tokenizer.json 2024-09-03 07:09:35,621 transformers.tokenization_utils_base 3539 INFO loading file added_tokens.json 2024-09-03 07:09:35,621 transformers.tokenization_utils_base 3539 INFO loading file special_tokens_map.json 2024-09-03 07:09:35,621 transformers.tokenization_utils_base 3539 INFO loading file tokenizer_config.json 2024-09-03 07:09:35,622 transformers.models.t5.tokenization_t5_fast 3539 WARNING You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file vocab.json 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file merges.txt 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file added_tokens.json 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file special_tokens_map.json 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file tokenizer_config.json 2024-09-03 07:09:39,643 transformers.tokenization_utils_base 3539 INFO loading file tokenizer.json 2024-09-03 07:09:39,644 transformers.models.clip.tokenization_clip 3539 INFO ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy. 2024-09-03 07:09:40,172 uvicorn.access 1 INFO 127.0.0.1:45718 - "POST /v1/models HTTP/1.1" 200

2024-09-03 07:09:54,464 xinference.api.restful_api 1 ERROR [address=0.0.0.0:39495, pid=3539] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 222, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 175, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:39495, pid=3539] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) 2024-09-03 07:09:54,468 uvicorn.access 1 INFO 10.90.152.64:63708 - "POST /v1/images/generations HTTP/1.1" 400

qinxuye commented 1 week ago
image

这里的 debug 一个都没看到……

qinxuye commented 1 week ago

pip show xinference

找到代码路径,把

https://github.com/xorbitsai/inference/blob/865c17496e97c2c2676b583d418b347dfcde9d9d/xinference/model/image/stable_diffusion/core.py#L154

这行注释掉看下能不能跑。

cnzayn commented 1 week ago

我也觉得奇怪,我通过xinference-local --log-level debug这个命令打开debug后,只有当前页面会有一些debug的打印,加载模型的时候就没有这个debug的标签,我还特意试了几次,确实没有

cnzayn commented 1 week ago

image

cnzayn commented 1 week ago

注释后,报一样的错误,我把你截图上要打印的日志改成info,再打印一下

cnzayn commented 1 week ago

2024-09-03 07:51:16,223 uvicorn.access 1 INFO 127.0.0.1:54862 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-09-03 07:51:24,203 xinference.model.utils 92 INFO Use model cache from a different hub. 2024-09-03 07:51:25,874 transformers.configuration_utils 749 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder_2/config.json 2024-09-03 07:51:25,875 transformers.configuration_utils 749 INFO Model config T5Config { "_name_or_path": "google/t5-v1_1-xxl", "architectures": [ "T5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 32128 }

2024-09-03 07:51:25,876 transformers.quantizers.quantizer_bnb_8bit 749 INFO Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning. 2024-09-03 07:51:25,876 transformers.quantizers.quantizer_bnb_8bit 749 INFO The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for inference, please set device_map ='auto' 2024-09-03 07:51:25,876 transformers.modeling_utils 749 WARNING low_cpu_mem_usage was None, now set to True since model is quantized. 2024-09-03 07:51:25,876 transformers.modeling_utils 749 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder_2/model.safetensors.index.json 2024-09-03 07:51:25,877 transformers.modeling_utils 749 INFO Instantiating T5EncoderModel model under default dtype torch.float16. 2024-09-03 07:51:31,572 transformers.modeling_utils 749 INFO All model checkpoint weights were used when initializing T5EncoderModel.

2024-09-03 07:51:31,572 transformers.modeling_utils 749 INFO All the weights of T5EncoderModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5EncoderModel for predictions without further training. 2024-09-03 07:51:31,656 xinference.model.image.stable_diffusion.core 749 INFO Loading model <class 'diffusers.pipelines.auto_pipeline.AutoPipelineForText2Image'> 2024-09-03 07:51:31,891 transformers.configuration_utils 749 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 07:51:31,892 transformers.configuration_utils 749 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 07:51:35,749 transformers.configuration_utils 749 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 07:51:35,750 transformers.configuration_utils 749 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 07:51:35,750 transformers.modeling_utils 749 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder/model.safetensors 2024-09-03 07:51:35,757 transformers.modeling_utils 749 INFO Instantiating CLIPTextModel model under default dtype torch.float16. 2024-09-03 07:51:36,126 transformers.modeling_utils 749 INFO All model checkpoint weights were used when initializing CLIPTextModel.

2024-09-03 07:51:36,127 transformers.modeling_utils 749 INFO All the weights of CLIPTextModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell/text_encoder. If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPTextModel for predictions without further training. 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file vocab.json 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file merges.txt 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file added_tokens.json 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file special_tokens_map.json 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file tokenizer_config.json 2024-09-03 07:51:36,162 transformers.tokenization_utils_base 749 INFO loading file tokenizer.json 2024-09-03 07:51:36,163 transformers.models.clip.tokenization_clip 749 INFO ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy. 2024-09-03 07:51:36,669 transformers.tokenization_utils_base 749 INFO loading file spiece.model 2024-09-03 07:51:36,669 transformers.tokenization_utils_base 749 INFO loading file tokenizer.json 2024-09-03 07:51:36,669 transformers.tokenization_utils_base 749 INFO loading file added_tokens.json 2024-09-03 07:51:36,670 transformers.tokenization_utils_base 749 INFO loading file special_tokens_map.json 2024-09-03 07:51:36,670 transformers.tokenization_utils_base 749 INFO loading file tokenizer_config.json 2024-09-03 07:51:36,670 transformers.models.t5.tokenization_t5_fast 749 WARNING You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers 2024-09-03 07:51:36,897 uvicorn.access 1 INFO 127.0.0.1:54878 - "POST /v1/models HTTP/1.1" 200 2024-09-03 07:52:26,933 xinference.api.restful_api 1 ERROR [address=0.0.0.0:41755, pid=749] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 220, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 173, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:41755, pid=749] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

cnzayn commented 1 week ago

2024-09-03 08:19:40,123 uvicorn.access 1 INFO 127.0.0.1:44986 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-09-03 08:19:40,131 xinference.core.supervisor 241 DEBUG Enter launch_builtin_model, model_uid: FLUX.1-schnell, model_name: FLUX.1-schnell, model_size: , model_format: None, quantization: None, replica: 1, kwargs: {'trust_remote_code': True, 'model-path': '/xinference/modelscope/hub/AI-ModelScope/FLUX_1-schnell', 'quantize_text_encoder': 'text_encoder_2'} 2024-09-03 08:19:40,132 xinference.core.worker 241 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f7c06f298a0>,), kwargs: {} 2024-09-03 08:19:40,133 xinference.core.worker 241 DEBUG Leave get_model_count, elapsed time: 0 s 2024-09-03 08:19:40,133 xinference.core.worker 241 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f7c06f298a0>,), kwargs: {'model_uid': 'FLUX.1-schnell-1-0', 'model_name': 'FLUX.1-schnell', 'model_size_in_billions': None, 'model_format': None, 'quantization': None, 'model_engine': None, 'model_type': 'image', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'download_hub': None, 'model_path': None, 'trust_remotecode': True, 'model-path': '/xinference/modelscope/hub/AI-ModelScope/FLUX1-schnell', 'quantize_text_encoder': 'text_encoder_2'} 2024-09-03 08:19:40,134 xinference.core.worker 241 DEBUG GPU selected: [0] for model FLUX.1-schnell-1-0 2024-09-03 08:19:48,078 xinference.model.image.core 241 DEBUG Image model FLUX.1-schnell found in ModelScope. 2024-09-03 08:19:48,078 xinference.model.utils 241 INFO Use model cache from a different hub. 2024-09-03 08:19:49,747 transformers.configuration_utils 643 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder_2/config.json 2024-09-03 08:19:49,748 transformers.configuration_utils 643 INFO Model config T5Config { "_name_or_path": "google/t5-v1_1-xxl", "architectures": [ "T5EncoderModel" ], "classifier_dropout": 0.0, "d_ff": 10240, "d_kv": 64, "d_model": 4096, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "num_decoder_layers": 24, "num_heads": 64, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "use_cache": true, "vocab_size": 32128 }

2024-09-03 08:19:49,749 transformers.quantizers.quantizer_bnb_8bit 643 INFO Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in 8-bit or 4-bit. Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass torch_dtype=torch.float16 to remove this warning. 2024-09-03 08:19:49,749 transformers.quantizers.quantizer_bnb_8bit 643 INFO The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for inference, please set device_map ='auto' 2024-09-03 08:19:49,749 transformers.modeling_utils 643 WARNING low_cpu_mem_usage was None, now set to True since model is quantized. 2024-09-03 08:19:49,750 transformers.modeling_utils 643 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder_2/model.safetensors.index.json 2024-09-03 08:19:49,750 transformers.modeling_utils 643 INFO Instantiating T5EncoderModel model under default dtype torch.float16. 2024-09-03 08:19:55,553 transformers.modeling_utils 643 INFO All model checkpoint weights were used when initializing T5EncoderModel.

2024-09-03 08:19:55,554 transformers.modeling_utils 643 INFO All the weights of T5EncoderModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5EncoderModel for predictions without further training. 2024-09-03 08:19:55,635 xinference.model.image.stable_diffusion.core 643 INFO Loading model <class 'diffusers.pipelines.auto_pipeline.AutoPipelineForText2Image'> 2024-09-03 08:19:55,906 transformers.configuration_utils 643 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 08:19:55,907 transformers.configuration_utils 643 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file vocab.json 2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file merges.txt 2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file added_tokens.json 2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file special_tokens_map.json 2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file tokenizer_config.json 2024-09-03 08:19:56,017 transformers.tokenization_utils_base 643 INFO loading file tokenizer.json 2024-09-03 08:19:56,018 transformers.models.clip.tokenization_clip 643 INFO ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy. 2024-09-03 08:19:56,577 transformers.tokenization_utils_base 643 INFO loading file spiece.model 2024-09-03 08:19:56,577 transformers.tokenization_utils_base 643 INFO loading file tokenizer.json 2024-09-03 08:19:56,577 transformers.tokenization_utils_base 643 INFO loading file added_tokens.json 2024-09-03 08:19:56,577 transformers.tokenization_utils_base 643 INFO loading file special_tokens_map.json 2024-09-03 08:19:56,577 transformers.tokenization_utils_base 643 INFO loading file tokenizer_config.json 2024-09-03 08:19:56,578 transformers.models.t5.tokenization_t5_fast 643 WARNING You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers 2024-09-03 08:20:00,595 transformers.configuration_utils 643 INFO loading configuration file /xinference/cache/FLUX.1-schnell/text_encoder/config.json 2024-09-03 08:20:00,596 transformers.configuration_utils 643 INFO Model config CLIPTextConfig { "_name_or_path": "openai/clip-vit-large-patch14", "architectures": [ "CLIPTextModel" ], "attention_dropout": 0.0, "bos_token_id": 0, "dropout": 0.0, "eos_token_id": 2, "hidden_act": "quick_gelu", "hidden_size": 768, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 77, "model_type": "clip_text_model", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "projection_dim": 768, "torch_dtype": "bfloat16", "transformers_version": "4.43.4", "vocab_size": 49408 }

2024-09-03 08:20:00,597 transformers.modeling_utils 643 INFO loading weights file /xinference/cache/FLUX.1-schnell/text_encoder/model.safetensors 2024-09-03 08:20:00,603 transformers.modeling_utils 643 INFO Instantiating CLIPTextModel model under default dtype torch.float16. 2024-09-03 08:20:00,976 transformers.modeling_utils 643 INFO All model checkpoint weights were used when initializing CLIPTextModel.

2024-09-03 08:20:00,976 transformers.modeling_utils 643 INFO All the weights of CLIPTextModel were initialized from the model checkpoint at /xinference/cache/FLUX.1-schnell/text_encoder. If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPTextModel for predictions without further training. 2024-09-03 08:20:01,014 xinference.core.worker 241 DEBUG Leave launch_builtin_model, elapsed time: 20 s 2024-09-03 08:20:01,015 uvicorn.access 1 INFO 127.0.0.1:44998 - "POST /v1/models HTTP/1.1" 200 2024-09-03 08:20:07,093 uvicorn.access 1 INFO 10.90.152.64:50923 - "GET /ui/ HTTP/1.1" 304 2024-09-03 08:20:07,196 xinference.core.supervisor 241 DEBUG Enter list_models, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f7a37a2ee80>,), kwargs: {} 2024-09-03 08:20:07,196 xinference.core.worker 241 DEBUG Enter list_models, args: (<xinference.core.worker.WorkerActor object at 0x7f7c06f298a0>,), kwargs: {} 2024-09-03 08:20:07,196 xinference.core.worker 241 DEBUG Leave list_models, elapsed time: 0 s 2024-09-03 08:20:07,197 xinference.core.supervisor 241 DEBUG Leave list_models, elapsed time: 0 s 2024-09-03 08:20:07,198 uvicorn.access 1 INFO 10.90.152.64:50923 - "GET /v1/models HTTP/1.1" 200 2024-09-03 08:20:07,199 uvicorn.access 1 INFO 10.90.152.64:50925 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-09-03 08:20:12,269 xinference.core.supervisor 241 DEBUG Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f7a37a2ee80>, 'FLUX.1-schnell'), kwargs: {} 2024-09-03 08:20:12,269 xinference.core.worker 241 DEBUG Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7f7c06f298a0>,), kwargs: {'model_uid': 'FLUX.1-schnell-1-0'} 2024-09-03 08:20:12,270 xinference.core.worker 241 DEBUG Leave get_model, elapsed time: 0 s 2024-09-03 08:20:12,270 xinference.core.supervisor 241 DEBUG Leave get_model, elapsed time: 0 s 2024-09-03 08:20:12,273 xinference.core.model 643 DEBUG Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7f4c6d981e40>,), kwargs: {'prompt': 'an apple', 'n': 1, 'size': '10241024', 'response_format': 'url'} 2024-09-03 08:20:12,274 xinference.core.model 643 DEBUG Request text_to_image, current serve request count: 0, request limit: None for the model FLUX.1-schnell-1-0 2024-09-03 08:20:12,275 xinference.model.image.stable_diffusion.core 643 DEBUG stable diffusion args: {'prompt': 'an apple', 'height': 1024, 'width': 1024, 'num_images_per_prompt': 1} 2024-09-03 08:20:13,128 xinference.core.model 643 DEBUG After request text_to_image, current serve request count: 0 for the model FLUX.1-schnell-1-0 2024-09-03 08:20:13,138 xinference.api.restful_api 1 ERROR [address=0.0.0.0:46533, pid=643] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1416, in create_images image_list = await model.text_to_image( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 698, in text_to_image return await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 401, in _call_wrapper_json return await self._call_wrapper("json", fn, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper ret = await asyncio.to_thread(fn, *args, *kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 220, in text_to_image return self._call_model( File "/usr/local/lib/python3.10/dist-packages/xinference/model/image/stable_diffusion/core.py", line 173, in _call_model images = model(kwargs).images File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call noise_pred = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_flux.py", line 366, in forward hidden_states = self.x_embedder(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 117, in forward return F.linear(input, self.weight, self.bias) RuntimeError: [address=0.0.0.0:46533, pid=643] Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) 2024-09-03 08:20:13,148 uvicorn.access 1 INFO 10.90.152.64:50927 - "POST /v1/images/generations HTTP/1.1" 400

更新下日志,这次是debug日志都打出来了

qinxuye commented 1 week ago

奇怪了,我们的测试机 3090 复现不出来。我想想怎么复现。

cnzayn commented 1 week ago

docker镜像是同一个版本吗?

qinxuye commented 1 week ago

我用的主分支。

cnzayn commented 1 week ago

是不是主分支更新了某些代码,你可以用我说的docker镜像版本试一下

qinxuye commented 1 week ago

那你可以试下 0.14.4.post1,应该是最新的。

cnzayn commented 1 week ago

用0.14.4.post1这个版本,也是报有一样的错误,实在想不明白是哪里的问题,显卡显存是足够的,而且跑其他大模型都没问题

qinxuye commented 1 week ago

我看看能不能找到显存差不多大的显卡试下,我们的卡跑都没有问题。

cnzayn commented 1 week ago

@qinxuye 你好,请问项目有微信群之类的吗?有问题可以在群里沟通

qinxuye commented 1 week ago

@qinxuye 你好,请问项目有微信群之类的吗?有问题可以在群里沟通

扫码加企业助手:https://xorbits.cn/assets/images/wechat_work_qr.png

cnzayn commented 1 week ago

xinference launch --model-path /model_zoo/flux1-schnell-Q4_0/flux1-schnell-Q4_0.gguf -f ggufv2 --model-name FLUX.1-schnell --model-type image --quantize_text_encoder text_encoder_2 ,我执行这个launch命令,理论上应该是去加载我下载好的模型,但实际是重新去modelscope上下载了,是launch命令写的有问题吗?

qinxuye commented 1 week ago

xinference launch --model-path /model_zoo/flux1-schnell-Q4_0/flux1-schnell-Q4_0.gguf -f ggufv2 --model-name FLUX.1-schnell --model-type image --quantize_text_encoder text_encoder_2 ,我执行这个launch命令,理论上应该是去加载我下载好的模型,但实际是重新去modelscope上下载了,是launch命令写的有问题吗?

--model_path

qinxuye commented 1 week ago

这个还没加到默认选项里去,晚点我们加下。

linssbf commented 1 week ago

when checking argument for argument mat1 in method wrapper_CUDA_addmm 单卡运行正常,多卡并发出这问题吗?

cnzayn commented 1 week ago

@linssbf 没试过单卡,服务器默认是8卡2080ti,上面有显卡信息的截图

linssbf commented 1 week ago

@cnzayn 你用的是docker环境,不是自己配置,所以很多不确定性。 nvlink驱动确定正常吗?你的环境第一要保证这驱动正常。 docker单卡如果运行正常,那问题就出来了,自己配一个环境方案。