Training error - Githubissues

thanhhau097 commented 5 years ago

When I trained your sample data with tensorflow-gpu 1.12, I got this error (I've cloned tf-1.12 branch, but it had same error).

INFO:tensorflow:Using config: {'_eval_distribute': None, '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true , '_save_checkpoints_steps': None, '_service': None, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_protocol': None, '_master': '', '_tf_random_seed': None, '_save_checkpoints_secs': 120, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd68bd37390>, '_experimental_distribute': None, '_keep_checkpoint_max': 5, '_is_chief': True, '_task_type': 'worker', '_device_fn': None, '_train_distribute': None, '_save_summary_steps': 100, '_model_dir': '../data/model', '_evaluation_master': '', '_num_ps_replicas': 0} Traceback (most recent call last): File "train.py", line 182, in tf.app.run() File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 179, in main classifier.train( input_fn=_get_input, max_steps=FLAGS.max_num_steps ) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1234, in _train_model_default input_fn, model_fn_lib.ModeKeys.TRAIN)) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1075, in _get_features_and_labels_from_input_fn self._call_input_fn(input_fn, mode)) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1162, in _call_input_fn return input_fn(kwargs) File "train.py", line 130, in _get_input dataset = pipeline.get_data( FLAGS.static_data, data_args) File "/home/lionel/Desktop/ML/mlcode/OCR/CRNN/cnn_lstm_ctc_ocr-master/src/pipeline.py", line 79, in get_data dataset = dpipe.get_dataset( dpipe_args ) File "/home/lionel/Desktop/ML/mlcode/OCR/CRNN/cnn_lstm_ctc_ocr-master/src/mjsynth.py", line 60, in get_dataset buffer_size=buffer_sz ) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/readers.py", line 218, in init prefetch_input_elements=None) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/readers.py", line 134, in init cycle_length, block_length) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2714, in init super(InterleaveDataset, self).init(input_dataset, map_func) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2677, in init experimental_nested_dataset_support=True) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1860, in init self._function.add_to_graph(ops.get_default_graph()) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/function.py", line 479, in add_to_graph self._create_definition_if_needed() File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/function.py", line 335, in _create_definition_if_needed self._create_definition_if_needed_impl() File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/function.py", line 344, in _create_definition_if_needed_impl self._capture_by_value, self._caller_device) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/function.py", line 864, in func_graph_from_py_func outputs = func(func_graph.inputs) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1794, in tf_data_structured_function_wrapper ret = func(nested_args) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/readers.py", line 210, in read_one_file return _TFRecordDataset(filename, compression_type, buffer_size) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/ops/readers.py", line 105, in init argument_default=_DEFAULT_READER_BUFFER_SIZE_BYTES) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/data/util/convert.py", line 32, in optional_param_to_tensor argument_value, dtype=argument_dtype, name=argument_name) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1050, in convert_to_tensor as_ref=False) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant value, dtype=dtype, shape=shape, verify_shape=verify_shape)) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 442, in make_tensor_proto _AssertCompatible(values, dtype) File "/home/lionel/virtualenv/commonenv/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 353, in _AssertCompatible (dtype.name, repr(mismatch), type(mismatch).name)) TypeError: Expected int64, got 256.0 of type 'float' instead.

weinman commented 5 years ago

Thanks for the report! Can you tell me whether you have the same error using Python 2.7? I've only tested the code for that version, and I'm guessing Python 3.x makes a different type inference somewhere.

If you do happen to spot it, a patch/PR will be most welcome. I will try to find and correct the error's source when I have time.

thanhhau097 commented 5 years ago

Thank for your response, I will try and give you feedback later.

GodV315 commented 5 years ago

Hello, have you solved this problem？

tschamp31 commented 4 years ago

I was receiving the same error traceback as above. The issue does seem to be Python 3.x. As I installed Python 2.7.16 on Conda. Then proceeded to add tensorflow via this repository https://github.com/fo40225/tensorflow-windows-wheel ... as windows TensorFlow environment requires 3.x. I hope this helps others. After I am done with my first training I will see if I can solve the porting issue to python 3.x since I will have a reference result.

weinman commented 4 years ago

Thanks for the update. Please return with a potential fix if you find one.

tschamp31 commented 4 years ago

Okay so as of current I have narrowed down some issues in updating to Python 3.x + Tensorflow 2.0 Beta. You are able to successful use python 2to3 over the src directory and then tf_upgrade_v2. There will be some files you will need to replace that are still using tf.contrib after the tf_upgrade_v2. Those often either can be updated via using Tensorflow-Addons (Linux only, no windows support) for Tensorflow 2.0 or "tf.contrib.x" can be replaced with "tf.compat.v1.x". Secondly the specific 256.0 float error is caused by any occurrence of "num_buffered_elements". All the ones I had to update appears to only be in pipeline.py. I simply just encased "num_buffered_elements" with int().

Example line ~81 "dataset = dataset.prefetch( int(num_buffered_elements) )"

Now with these changes this has not solved all the issues with conversion but I wanted to give a new goalpost for anyone else pursuing the updating. As of current I believe there was a change in how tfrecord was created/stored/retrieved that is now causing my current error. I will report back once I am able to successful training.

Good luck!

weinman commented 4 years ago

Thanks for the update! I'd be glad to receive a pull request and keep track of these changes in a new branch.

jeremmyzong commented 4 years ago

In src/train.py, changing

data_args = { 'num_threads': FLAGS.num_input_threads,
                  'batch_size': gpu_batch_size,
                  'filter_fn': filter_fn }

to

data_args = { 'num_threads': FLAGS.num_input_threads,
                  'batch_size': int(gpu_batch_size),
                  'filter_fn': filter_fn }

would fix this.

weinman commented 4 years ago

Thanks @jeremmyzong !

weinman / cnn_lstm_ctc_ocr

Training error #49