wilhelm-lab / koina

Democratizing ML in proteomics
https://koina.wilhelmlab.org/
Apache License 2.0
29 stars 14 forks source link

UniSpec error on tutorial page #116

Closed yangkl96 closed 2 months ago

yangkl96 commented 2 months ago

Hello,

I am getting the following error running UniSpec:

{"error":"Failed to process the request(s) for model instance 'UniSpec', message: TritonModelException: PyTorch execute failure: The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n File \"code/__torch__/models.py\", line 46, in forward\n out = torch.einsum(\"abc,bd->adc\", [inp, embed])\n input = torch.add_(out, pos)\n _12 = (_0).forward((embed_norm).forward(input, ), mask, )\n ~~~~~~~~~~~~~~~~~~~ <--- HERE\n _13 = (_2).forward((_1).forward(_12, mask, ), mask, )\n _14 = (_4).forward((_3).forward(_13, mask, ), mask, )\n File \"code/__torch__/torch/nn/modules/batchnorm.py\", line 17, in forward\n bias = self.bias\n weight = self.weight\n inp = torch.batch_norm(input, weight, bias, running_mean, running_var, False, 0.10000000000000001, 1.0000000000000001e-05, True)\n ~~~~~~~~~~~~~~~~ <--- HERE\n return inp\n\nTraceback of TorchScript, original code (most recent call last):\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/functional.py(2478): batch_norm\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py(171): forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1508): _slow_forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1527): _call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n/cmnfs/home/j.lapin/projects/UniSpec/models.py(488): forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1508): _slow_forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1527): _call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/jit/_trace.py(1065): trace_module\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/jit/_trace.py(798): trace\n<ipython-input-8-a348496ade97>(1): <module>\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3526): run_code\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3466): run_ast_nodes\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3284): run_cell_async\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/async_helpers.py(129): _pseudo_sync_runner\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3079): _run_cell\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3024): run_cell\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py(881): interact\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py(888): mainloop\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/ipapp.py(318): start\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/traitlets/config/application.py(992): launch_instance\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/__init__.py(129): start_ipython\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/__main__.py(15): <module>\n<frozen runpy>(88): _run_code\n<frozen runpy>(198): _run_module_as_main\nRuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR\n\n\nAt:\n /models/repo/UniSpec/1/model.py(527): predict_batch\n /models/repo/UniSpec/1/model.py(498): execute\n"}

Running the tutorial example in the documentation also has an error: {"error":"Failed to process the request(s) for model instance 'UniSpec', message: TritonModelException: PyTorch execute failure: The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n File \"code/__torch__/models.py\", line 46, in forward\n out = torch.einsum(\"abc,bd->adc\", [inp, embed])\n input = torch.add_(out, pos)\n _12 = (_0).forward((embed_norm).forward(input, ), mask, )\n ~~~~~~~~~~~~~~~~~~~ <--- HERE\n _13 = (_2).forward((_1).forward(_12, mask, ), mask, )\n _14 = (_4).forward((_3).forward(_13, mask, ), mask, )\n File \"code/__torch__/torch/nn/modules/batchnorm.py\", line 17, in forward\n bias = self.bias\n weight = self.weight\n inp = torch.batch_norm(input, weight, bias, running_mean, running_var, False, 0.10000000000000001, 1.0000000000000001e-05, True)\n ~~~~~~~~~~~~~~~~ <--- HERE\n return inp\n\nTraceback of TorchScript, original code (most recent call last):\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/functional.py(2478): batch_norm\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py(171): forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1508): _slow_forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1527): _call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n/cmnfs/home/j.lapin/projects/UniSpec/models.py(488): forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1508): _slow_forward\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1527): _call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/jit/_trace.py(1065): trace_module\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/torch/jit/_trace.py(798): trace\n<ipython-input-8-a348496ade97>(1): <module>\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3526): run_code\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3466): run_ast_nodes\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3284): run_cell_async\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/async_helpers.py(129): _pseudo_sync_runner\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3079): _run_cell\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/core/interactiveshell.py(3024): run_cell\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py(881): interact\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/interactiveshell.py(888): mainloop\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/terminal/ipapp.py(318): start\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/traitlets/config/application.py(992): launch_instance\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/__init__.py(129): start_ipython\n/cmnfs/home/j.lapin/miniconda3/envs/torch/lib/python3.11/site-packages/IPython/__main__.py(15): <module>\n<frozen runpy>(88): _run_code\n<frozen runpy>(198): _run_module_as_main\nRuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR\n\n\nAt:\n /models/repo/UniSpec/1/model.py(527): predict_batch\n /models/repo/UniSpec/1/model.py(498): execute\n"}

LLautenbacher commented 2 months ago

Thank you for letting us know! I restarted the server, which at least temporarily solved the issue. I will look into the error; at first glance, it seems like an issue with the GPU drivers.