Closed jiunyen-ching closed 4 years ago
Hi, @jiunyen-ching
Thanks for your report. I will look into it and try to target the problem bothering you.
Best Yuxiao Guo
If it helps with debugging, here's the script that was edited according to my machine and ran.
export CUDA_VISIBLE_DEVICES=0
python train.py --input-previous-model-path ./previous --input-training-data-path ~/Downloads/NYUCAD/NYUCAD-TF-60 --input-validation-data-path ~/Downloads/NYUCAD/NYUCAD-TF-60 --input-gpu-nums 1 --input-network VVNetAE120 --log-dir ./log --max-iters 150000 --batch-per-device 2 --output-model-path ./ckp --phase test
I have no previous models so I just made an empty folder and kept the --input-previous-model-path ./previous as it is.
I found there is nothing in generated TFRecord. After checking, I notice a file path bug in the current code. You may try the updated version in the master. Hope it will be helpful.
I have made the changes and the error still persists. Previously, my generated .tfrecords had significant sizes.
Here are the files generated with lines:
depth_path = os.path.join(folder_path, sample + '.png')
bin_path = os.path.join(folder_path, sample + '.bin')
NYUCADtest.tfrecord - 776.3 MB NYUCADtrain - 929.0 MB NYUtest.tfrecord - 991.3 MB NYUtrain.tfrecord - 1.2GB
After making the changes as such,
depth_path = sample + '.png'
bin_path = sample + '.bin'
the file sizes are same.
For sanity check, I printed _depthpath and _binpath for when generating .tfrecords for before and after I committed changes to the code. A sample of output for one file is as below:
--0000601 write /home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000 in TFRECORDS
/home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000.png
/home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000.bin
Suppose that the .tfrecords that you generated are of the same size and uncorrupted, do you mind to share your .tfrecords so that I may try to run on my machine?
I think it's fine to share the NYU & NYUCAD tfrecords with you. However, it's a relatively long time since this project has been finished. I will re-configure the environment in my local machine and target the problems in the current pipeline. Of course, I will share the training/test sets with you as well. I hope to finish it in these two days.
Thanks a ton @yuxiaoguo! Will look forward to your findings.
The issue has been addressed. The data prepare code wasn't updated with the training code. The training pipeline uses non-compression TFRecords (since voxel and images are compressed with their own file format), while data prepare code still compresses the TFRecords.
I have configured and run the experiments with generated data. It goes well. One more thing, if you tend to play with NYU & NYUCAD datasets, please set the --record-iters to 200, which is also the setting reported in our paper.
I see. I am now able to execute _runtraining.sh. Will update here if anything related comes up.
Thanks a lot for your help. Will close issue now.
Hi @yuxiaoguo, I successfully generated the .tfrecords for NYU and NYUCAD and now I want to try "run_test.sh". However, I ran into this error message and is unable to proceed.
Any suggestions to tackle this problem?
Device info: OS: Ubuntu-16.04 Python: 3.5 Tensorflow: 1.3.0-rc2 CUDA: 8.0 CUDNN: 6.0
Error log:
(tfpy35) ching@ching-700-275d:~/Downloads/VVNet-master$ source run_test.sh /home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from
floatto
np.floatingis deprecated. In future, it will be treated as
np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters /home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from
floatto
np.floatingis deprecated. In future, it will be treated as
np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters [2019-09-24 17:35:48,703] INFO: =============================== [2019-09-24 17:35:48,703] INFO: os=Linux [2019-09-24 17:35:48,704] INFO: host=ching-700-275d [2019-09-24 17:35:48,704] INFO: visible_device=0 [2019-09-24 17:35:48,704] INFO: known: Namespace(batch_per_device=2, eval_platform='suncg', eval_results='eval', input_gpu_nums=1, input_network='VVNetAE120', input_previous_model_path='./previous', input_training_data_path='/home/ching/Downloads/NYU-TF-60', input_validation_data_path='/home/ching/Downloads/NYU-TF-60', log_dir='./log', max_iters=150000, output_model_path='./ckp', phase='test', record_iters=2000) [2019-09-24 17:35:48,704] INFO: unknown: [] Traceback (most recent call last): File "train.py", line 45, in <module> test.eval_network(args) File "/home/ching/Downloads/VVNet-master/scripts/test.py", line 39, in eval_network num_samples = sum(1 for _ in tf.python_io.tf_record_iterator(test_records[0])) File "/home/ching/Downloads/VVNet-master/scripts/test.py", line 39, in <genexpr> num_samples = sum(1 for _ in tf.python_io.tf_record_iterator(test_records[0])) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/tensorflow/python/lib/io/tf_record.py", line 77, in tf_record_iterator reader.GetNext(status) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0