DataLossError: corrupted record at 0 when executing "source run_test.sh"

jiunyen-ching commented 4 years ago

Hi @yuxiaoguo, I successfully generated the .tfrecords for NYU and NYUCAD and now I want to try "run_test.sh". However, I ran into this error message and is unable to proceed.

Any suggestions to tackle this problem?

Device info: OS: Ubuntu-16.04 Python: 3.5 Tensorflow: 1.3.0-rc2 CUDA: 8.0 CUDNN: 6.0

Error log: (tfpy35) ching@ching-700-275d:~/Downloads/VVNet-master$ source run_test.sh /home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters /home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters [2019-09-24 17:35:48,703] INFO: =============================== [2019-09-24 17:35:48,703] INFO: os=Linux [2019-09-24 17:35:48,704] INFO: host=ching-700-275d [2019-09-24 17:35:48,704] INFO: visible_device=0 [2019-09-24 17:35:48,704] INFO: known: Namespace(batch_per_device=2, eval_platform='suncg', eval_results='eval', input_gpu_nums=1, input_network='VVNetAE120', input_previous_model_path='./previous', input_training_data_path='/home/ching/Downloads/NYU-TF-60', input_validation_data_path='/home/ching/Downloads/NYU-TF-60', log_dir='./log', max_iters=150000, output_model_path='./ckp', phase='test', record_iters=2000) [2019-09-24 17:35:48,704] INFO: unknown: [] Traceback (most recent call last): File "train.py", line 45, in <module> test.eval_network(args) File "/home/ching/Downloads/VVNet-master/scripts/test.py", line 39, in eval_network num_samples = sum(1 for _ in tf.python_io.tf_record_iterator(test_records[0])) File "/home/ching/Downloads/VVNet-master/scripts/test.py", line 39, in <genexpr> num_samples = sum(1 for _ in tf.python_io.tf_record_iterator(test_records[0])) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/tensorflow/python/lib/io/tf_record.py", line 77, in tf_record_iterator reader.GetNext(status) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/ching/anaconda3/envs/tfpy35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

yuxiaoguo commented 4 years ago

Hi, @jiunyen-ching

Thanks for your report. I will look into it and try to target the problem bothering you.

Best Yuxiao Guo

jiunyen-ching commented 4 years ago

If it helps with debugging, here's the script that was edited according to my machine and ran.

export CUDA_VISIBLE_DEVICES=0
python train.py --input-previous-model-path ./previous --input-training-data-path ~/Downloads/NYUCAD/NYUCAD-TF-60 --input-validation-data-path ~/Downloads/NYUCAD/NYUCAD-TF-60 --input-gpu-nums 1 --input-network VVNetAE120 --log-dir ./log --max-iters 150000 --batch-per-device 2 --output-model-path ./ckp  --phase test

I have no previous models so I just made an empty folder and kept the --input-previous-model-path ./previous as it is.

yuxiaoguo commented 4 years ago

I found there is nothing in generated TFRecord. After checking, I notice a file path bug in the current code. You may try the updated version in the master. Hope it will be helpful.

jiunyen-ching commented 4 years ago

I have made the changes and the error still persists. Previously, my generated .tfrecords had significant sizes.

Here are the files generated with lines:

depth_path = os.path.join(folder_path, sample + '.png')
bin_path = os.path.join(folder_path, sample + '.bin')

NYUCADtest.tfrecord - 776.3 MB NYUCADtrain - 929.0 MB NYUtest.tfrecord - 991.3 MB NYUtrain.tfrecord - 1.2GB

After making the changes as such,

depth_path = sample + '.png'
bin_path = sample + '.bin'

the file sizes are same.

For sanity check, I printed _depthpath and _binpath for when generating .tfrecords for before and after I committed changes to the code. A sample of output for one file is as below:

--0000601 write /home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000 in TFRECORDS
/home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000.png
/home/ching/Downloads/NYUCAD/NYUCADtrain/NYU0289_0000.bin

Suppose that the .tfrecords that you generated are of the same size and uncorrupted, do you mind to share your .tfrecords so that I may try to run on my machine?

yuxiaoguo commented 4 years ago

I think it's fine to share the NYU & NYUCAD tfrecords with you. However, it's a relatively long time since this project has been finished. I will re-configure the environment in my local machine and target the problems in the current pipeline. Of course, I will share the training/test sets with you as well. I hope to finish it in these two days.

jiunyen-ching commented 4 years ago

Thanks a ton @yuxiaoguo! Will look forward to your findings.

yuxiaoguo commented 4 years ago

The issue has been addressed. The data prepare code wasn't updated with the training code. The training pipeline uses non-compression TFRecords (since voxel and images are compressed with their own file format), while data prepare code still compresses the TFRecords.

I have configured and run the experiments with generated data. It goes well. One more thing, if you tend to play with NYU & NYUCAD datasets, please set the --record-iters to 200, which is also the setting reported in our paper.

jiunyen-ching commented 4 years ago

I see. I am now able to execute _runtraining.sh. Will update here if anything related comes up.

Thanks a lot for your help. Will close issue now.

yuxiaoguo / VVNet

DataLossError: corrupted record at 0 when executing "source run_test.sh" #3