The training part runs fine but then it starts to do the eval.
It becomes super slow and seems not using GPU anymore and only running on CPUs.
This causes it to be very slow and consume a lot of memory and ended up OOM and crashed.
The only warning I got is the following:
2022-06-27 14:06:50.915791: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:55]
Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
Compiling module jit_train_step.503
I would be really appreciated if you could advice how I can resolve this issue.
Hi,
I am able to run on GPU A100 with the training command as you suggested:
This is w/o xformers installed.
The training part runs fine but then it starts to do the eval. It becomes super slow and seems not using GPU anymore and only running on CPUs. This causes it to be very slow and consume a lot of memory and ended up OOM and crashed.
The only warning I got is the following:
I would be really appreciated if you could advice how I can resolve this issue.