BUG 20G以上模型导致机器重启

tsiens commented 1 week ago

Describe the bug

运行20G以上模型，机器就会自动重启

To Reproduce

To help us to reproduce this bug, please provide information below:

Your Python version.
The version of xinference you use. docker v0.12.0
Versions of crucial packages. 8*4090 24G docker info Client: Version: 25.0.5 Context: default Debug Mode: false

Server: Containers: 4 Running: 2 Paused: 0 Stopped: 2 Images: 4 Server Version: 25.0.5 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 Default Runtime: runc Init Binary: docker-init containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8 runc version: v1.1.12-0-g51d5e94 init version: de40ad0 Security Options: seccomp Profile: builtin Kernel Version: 3.10.0-1160.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 144 Total Memory: 755.1GiB

Full stack of the error.
Minimized code to reproduce the error.

Expected behavior

小模型都正常加载使用，如qwen2，完全没问题 1、20G以上模型，比如glm4，指定双卡，9997网页问答，胡言乱语几次，然后重启 2、指定单卡，正确返回几次，然后重启 3、docker安装ollama，正常使用，因此觉得不是机器、docker的问题

Additional context

Add any other context about the problem here.

qinxuye commented 1 week ago

用的什么引擎？

tsiens commented 1 week ago

用的什么引擎？

transformer、vllm都试过，自定义注册和自带的语言模型都用过

tsiens commented 1 week ago

尝试使用anaconda创建env环境，然后本地启动xinference-loca，同样重启了

qinxuye commented 4 days ago

问题有解决吗？

tsiens commented 3 days ago

问题有解决吗？

没有，查不出原因

xorbitsai / inference