xychelsea / deepfacelab-docker

Docker Containers for DeepFaceLab with TensorFlow in Anaconda 3
GNU General Public License v3.0
45 stars 22 forks source link

Why docker doesn't use all the VRAM on EC2 instance? #18

Closed Lenny4 closed 1 year ago

Lenny4 commented 1 year ago

I launched an g4dn.4xlarge (with Ubuntu) instance on AWS hoping to use DeepFaceLab on it.

Instance Size GPU vCPUs Memory (GiB) GPU memory (GiB)
g4dn.4xlarge 1 16 64 16

I successfully installed the nvidia-container-toolkit, and launched a docker (which contains all the dependencies to run Deepfacelab) on it with this command:

docker run --gpus all --rm -it
     -v workspace:/usr/local/deepface/workspace \
     xychelsea/deepfacelab:latest-gpu /bin/bash

The container launch successfully. I start extracting images and faceset. Then comes the training phase, here is the output:

$ bash scripts/6_train_Quick96.sh 
Running trainer.

[new] No saved models found. Enter a name of a new model : quick96
quick96

Model first run.

Choose one or several GPU idxs (separated by comma).

[CPU] : CPU
  [0] : Tesla T4

[0] Which GPU indexes to choose? : 
0

Initializing models: 100%|###################################################################################################################################################| 5/5 [00:01<00:00,  3.31it/s]
Loading samples: 100%|################################################################################################################################################| 1222/1222 [00:02<00:00, 436.58it/s]
Loading samples: 100%|################################################################################################################################################| 1217/1217 [00:02<00:00, 523.04it/s]
============ Model Summary =============
==                                    ==
==        Model name: quick96_Quick96 ==
==                                    ==
== Current iteration: 0               ==
==                                    ==
==---------- Model Options -----------==
==                                    ==
==        batch_size: 4               ==
==                                    ==
==------------ Running On ------------==
==                                    ==
==      Device index: 0               ==
==              Name: Tesla T4        ==
==              VRAM: 0.02GB          ==
==                                    ==
========================================
Starting. Press "Enter" to stop training and save model.

Trying to do the first iteration. If an error occurs, reduce the model parameters.

Error: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'decoder_dst/res0/conv1/weight/read':
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/deepfacelab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
    self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
    self.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
    layer.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
    layer.build_weights()
  File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
    self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
    ret = gen_array_ops.identity(input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
    "Identity", input=input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/deepfacelab/mainscripts/Trainer.py", line 129, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "/usr/local/deepfacelab/models/ModelBase.py", line 474, in train_one_iter
    losses = self.onTrainOneIter()
  File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 276, in onTrainOneIter
    warped_dst, target_dst, target_dstm)
  File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 178, in src_dst_train
    self.target_dstm:target_dstm,
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
         [[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'decoder_dst/res0/conv1/weight/read':
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/deepfacelab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
    self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
    self.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
    layer.build()
  File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
    self._build_sub(v[name],name)
  File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
    layer.build_weights()
  File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
    self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
    aggregation=aggregation)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
    shape=shape)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
    ret = gen_array_ops.identity(input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
    "Identity", input=input, name=name)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

I have an error telling me that Resource exhausted while I have 64Go memory and 16VRAM. If you look the Model Summary it says VRAM: 0.02GB I have way more VRAM on the g4dn.4xlarge instance.

What's the problem here ?

Why can't I use all the VRAM available from the host in my docker ?

zabique commented 1 year ago

DFL should run on GPU and in your case GPU has VRAM: 0.02GB this is why u see OOM errors

Lenny4 commented 1 year ago

Hi @zabique , please look at https://aws.amazon.com/ec2/instance-types/g4/#Product_Details, I have way more vRam than 0.02GB on a g4dn.4xlarge instance

zabique commented 1 year ago

I just say what tensorflow is reporting, you can see that too.

Lenny4 commented 1 year ago

Yes it was my conclusion too

I have an error telling me that Resource exhausted while I have 64Go memory. If you look the Model Summary it says VRAM: 0.02GB I have way more VRAM on the g4dn.4xlarge instance.

Do you know how I can fix that ?

zabique commented 1 year ago

sorry I look stupid now, you clearly know that this is the problem, not RAM itself. I always install it as conda env in normal linux instance, never used Docker.