error while running mnist_tf_ds.py

requirement.txt tensorflow==2.10.0 tensorflow-datasets tensorflow-io

2023-09-23 06:57:51,595 WARNING (MainThread-18213) From mnist_tf_ds.py:36: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. Instructions for updating: use distribute.MultiWorkerMirroredStrategy instead 2023-09-23 06:57:51.598567: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/compat:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/apache/hadoop/lib/native/:/apache/java/jre/lib/amd64/server/ 2023-09-23 06:57:51.598613: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303) 2023-09-23 06:57:51.598645: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hdc42-mcc10-01-0510-3605-052-tess0097): /proc/driver/nvidia/version does not exist 2023-09-23 06:57:51.600086: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-23 06:57:51.617826: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 10.97.207.26:37937} 2023-09-23 06:57:51.617870: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.179.131.146:36107, 1 -> 10.193.193.154:46431, 2 -> 10.216.144.168:36169} 2023-09-23 06:57:51.618003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 10.97.207.26:37937} 2023-09-23 06:57:51.618026: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.179.131.146:36107, 1 -> 10.193.193.154:46431, 2 -> 10.216.144.168:36169} 2023-09-23 06:57:51.619000: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://10.179.131.146:36107 2023-09-23 06:57:52.622546: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:281] Coordination agent has successfully connected. 2023-09-23 06:57:52,623 INFO (MainThread-18213) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:0/device:CPU:0', '/job:chief/replica:0/task:0/device:CPU:0', '/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:2/device:CPU:0'] 2023-09-23 06:57:52,659 INFO (MainThread-18213) Check health not enabled. 2023-09-23 06:57:52,659 INFO (MainThread-18213) MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['10.97.207.26:37937'], 'worker': ['10.179.131.146:36107', '10.193.193.154:46431', '10.216.144.168:36169']}, task_type = 'worker', task_id = 0, num_workers = 4, local_devices = ('/job:worker/task:0/device:CPU:0',), communication = CommunicationImplementation.AUTO image_pattern is viewfs://apollo-rno/apps/hdmi-ebayadvertising/dmp/qa/zhiwyu/tfos/mnist/tfr/tfr/train/part* 2023-09-23 06:57:59,107 WARNING (MainThread-18213) AutoGraph could not transform <function main_fun.. at 0x7fcdef703af0> and will run it as-is. Cause: could not parse the source code of <function main_fun.. at 0x7fcdef703af0>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert 2023-09-23 06:57:59,134 WARNING (MainThread-18213) AutoGraph could not transform <function main_fun..parse_tfos at 0x7fcdef748310> and will run it as-is. Cause: Unable to locate the source code of <function main_fun..parse_tfos at 0x7fcdef748310>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert 2023-09-23 06:57:59.977508: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations. Epoch 1/3 2023-09-23 06:58:00,045 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,285 INFO (MainThread-18213) Collective all_reduce tensors: 6 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,368 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,383 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,397 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,411 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,421 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,483 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,593 INFO (MainThread-18213) Collective all_reduce tensors: 6 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,638 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,647 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,659 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,674 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1

1/234 [..............................] - ETA: 25:42 - loss: 2.3018 - accuracy: 0.0703 3/234 [..............................] - ETA: 6s - loss: 2.2965 - accuracy: 0.0664 223/234 [===========================>..] - ETA: 0s - loss: 2.1595 - accuracy: 0.4079 225/234 [===========================>..] - ETA: 0s - loss: 2.1589 - accuracy: 0.4084 227/234 [============================>.] - ETA: 0s - loss: 2.1585 - accuracy: 0.4089 229/234 [============================>.] - ETA: 0s - loss: 2.1579 - accuracy: 0.4097 231/234 [============================>.] - ETA: 0s - loss: 2.1575 - accuracy: 0.4101 233/234 [============================>.] - ETA: 0s - loss: 2.1569 - accuracy: 0.4105 235/234 [==============================] - ETA: 0s - loss: 2.1563 - accuracy: 0.4113 Epoch 2: saving model to ./mnist_model/workertemp_0/weights-0002 234/234 [==============================] - 8s 34ms/step - loss: 2.1563 - accuracy: 0.4113 Epoch 3/3

1/234 [..............................] - ETA: 5s - loss: 2.0831 - accuracy: 0.5000 3/234 [..............................] - ETA: 6s - loss: 2.1071 - accuracy: 0.4349 5/234 [..............................] - ETA: 6s - loss: 2.0965 - accuracy: 0.4641 7/234 [..............................] - ETA: 6s - loss: 2.1002 - accuracy: 0.4542 9/234 [>.............................] - ETA: 6s - loss: 2.0954 - accuracy: 0.4627 11/234 [>.............................] - ETA: 6s - loss: 2.0917 - accuracy: 0.4727 13/234 [>.............................] - ETA: 6s - loss: 2.0909 - accuracy: 0.4766 15/234 [>.............................] - ETA: 6s - loss: 2.0904 - accuracy: 0.4784 17/234 [=>............................] - ETA: 6s - loss: 2.0892 - accuracy: 0.4848 19/234 [=>............................] - ETA: 6s - loss: 2.0890 - accuracy: 0.4846 21/234 [=>............................] - ETA: 6s - loss: 2.0876 - accuracy: 0.4821 23/234 [=>............................] - ETA: 6s - loss: 2.0874 - accuracy: 0.4791 25/234 [==>...........................] - ETA: 6s - loss: 2.0870 - accuracy: 0.4786 27/234 [==>...........................] - ETA: 6s - loss: 2.0867 - accuracy: 0.4784 29/234 [==>...........................] - ETA: 6s - loss: 2.0867 - accuracy: 0.4786 31/234 [==>...........................] - ETA: 5s - loss: 2.0863 - accuracy: 0.4797 33/234 [===>..........................] - ETA: 5s - loss: 2.0856 - accuracy: 0.4794 35/234 [===>..........................] - ETA: 5s - loss: 2.0853 - accuracy: 0.4783 37/234 [===>..........................] - ETA: 5s - loss: 2.0841 - accuracy: 0.4791 39/234 [===>..........................] - ETA: 5s - loss: 2.0841 - accuracy: 0.4780 41/234 [====>.........................] - ETA: 5s - loss: 2.0838 - accuracy: 0.4774 43/234 [====>.........................] - ETA: 5s - loss: 2.0829 - accuracy: 0.4774 45/234 [====>.........................] - ETA: 5s - loss: 2.0829 - accuracy: 0.4777 47/234 [=====>........................] - ETA: 5s - loss: 2.0822 - accuracy: 0.4787 49/234 [=====>........................] - ETA: 5s - loss: 2.0818 - accuracy: 0.4798 51/234 [=====>........................] - ETA: 5s - loss: 2.0811 - accuracy: 0.4809 53/234 [=====>........................] - ETA: 5s - loss: 2.0801 - accuracy: 0.4814 55/234 [======>.......................] - ETA: 5s - loss: 2.0790 - accuracy: 0.4832 57/234 [======>.......................] - ETA: 5s - loss: 2.0787 - accuracy: 0.4841 59/234 [======>.......................] - ETA: 5s - loss: 2.0781 - accuracy: 0.4836 61/234 [======>.......................] - ETA: 5s - loss: 2.0769 - accuracy: 0.4860 63/234 [=======>......................] - ETA: 5s - loss: 2.0761 - accuracy: 0.4864 65/234 [=======>......................] - ETA: 5s - loss: 2.0755 - accuracy: 0.4870 67/234 [=======>......................] - ETA: 5s - loss: 2.0747 - accuracy: 0.4882 69/234 [=======>......................] - ETA: 5s - loss: 2.0737 - accuracy: 0.4891 71/234 [========>.....................] - ETA: 4s - loss: 2.0738 - accuracy: 0.4886 73/234 [========>.....................] - ETA: 4s - loss: 2.0729 - accuracy: 0.4900 75/234 [========>.....................] - ETA: 4s - loss: 2.0726 - accuracy: 0.4902 77/234 [========>.....................] - ETA: 4s - loss: 2.0723 - accuracy: 0.4905 79/234 [=========>....................] - ETA: 4s - loss: 2.0714 - accuracy: 0.4903 81/234 [=========>....................] - ETA: 4s - loss: 2.0707 - accuracy: 0.4905 83/234 [=========>....................] - ETA: 4s - loss: 2.0697 - accuracy: 0.4921 85/234 [=========>....................] - ETA: 4s - loss: 2.0690 - accuracy: 0.4925 87/234 [==========>...................] - ETA: 4s - loss: 2.0682 - accuracy: 0.4922 89/234 [==========>...................] - ETA: 4s - loss: 2.0677 - accuracy: 0.4927 91/234 [==========>...................] - ETA: 4s - loss: 2.0671 - accuracy: 0.4930 93/234 [==========>...................] - ETA: 4s - loss: 2.0664 - accuracy: 0.4936 95/234 [===========>..................] - ETA: 4s - loss: 2.0659 - accuracy: 0.4942 97/234 [===========>..................] - ETA: 4s - loss: 2.0655 - accuracy: 0.4947 99/234 [===========>..................] - ETA: 4s - loss: 2.0648 - accuracy: 0.4955 101/234 [===========>..................] - ETA: 4s - loss: 2.0641 - accuracy: 0.4962 103/234 [============>.................] - ETA: 3s - loss: 2.0637 - accuracy: 0.4955 105/234 [============>.................] - ETA: 3s - loss: 2.0629 - accuracy: 0.4959 107/234 [============>.................] - ETA: 3s - loss: 2.0620 - accuracy: 0.4963 109/234 [============>.................] - ETA: 3s - loss: 2.0614 - accuracy: 0.4965 111/234 [=============>................] - ETA: 3s - loss: 2.0604 - accuracy: 0.4977 113/234 [=============>................] - ETA: 3s - loss: 2.0603 - accuracy: 0.4965 115/234 [=============>................] - ETA: 3s - loss: 2.0592 - accuracy: 0.4979 117/234 [=============>................] - ETA: 3s - loss: 2.0584 - accuracy: 0.4983 119/234 [==============>...............] - ETA: 3s - loss: 2.0578 - accuracy: 0.4985 121/234 [==============>...............] - ETA: 3s - loss: 2.0571 - accuracy: 0.4990 123/234 [==============>...............] - ETA: 3s - loss: 2.0561 - accuracy: 0.5001 125/234 [===============>..............] - ETA: 3s - loss: 2.0557 - accuracy: 0.4998 127/234 [===============>..............] - ETA: 3s - loss: 2.0551 - accuracy: 0.5001 129/234 [===============>..............] - ETA: 3s - loss: 2.0543 - accuracy: 0.5015 131/234 [===============>..............] - ETA: 3s - loss: 2.0538 - accuracy: 0.5019 133/234 [================>.............] - ETA: 3s - loss: 2.0530 - accuracy: 0.5026 135/234 [================>.............] - ETA: 3s - loss: 2.0524 - accuracy: 0.5026 137/234 [================>.............] - ETA: 2s - loss: 2.0515 - accuracy: 0.5034 139/234 [================>.............] - ETA: 2s - loss: 2.0507 - accuracy: 0.5042 141/234 [=================>............] - ETA: 2s - loss: 2.0500 - accuracy: 0.5050 143/234 [=================>............] - ETA: 2s - loss: 2.0492 - accuracy: 0.5053 145/234 [=================>............] - ETA: 2s - loss: 2.0485 - accuracy: 0.5058 147/234 [=================>............] - ETA: 2s - loss: 2.0477 - accuracy: 0.5062 149/234 [==================>...........] - ETA: 2s - loss: 2.0472 - accuracy: 0.5063 152/234 [==================>...........] - ETA: 2s - loss: 2.0464 - accuracy: 0.5068 155/234 [==================>...........] - ETA: 2s - loss: 2.0451 - accuracy: 0.5078 158/234 [===================>..........] - ETA: 2s - loss: 2.0444 - accuracy: 0.5075 160/234 [===================>..........] - ETA: 2s - loss: 2.0439 - accuracy: 0.5076 162/234 [===================>..........] - ETA: 2s - loss: 2.0433 - accuracy: 0.5080 164/234 [===================>..........] - ETA: 2s - loss: 2.0423 - accuracy: 0.5085 166/234 [====================>.........] - ETA: 2s - loss: 2.0417 - accuracy: 0.5085 168/234 [====================>.........] - ETA: 2s - loss: 2.0408 - accuracy: 0.5089 170/234 [====================>.........] - ETA: 2s - loss: 2.0401 - accuracy: 0.5097 172/234 [=====================>........] - ETA: 2s - loss: 2.0393 - accuracy: 0.5098 175/234 [=====================>........] - ETA: 1s - loss: 2.0383 - accuracy: 0.5106 177/234 [=====================>........] - ETA: 1s - loss: 2.0377 - accuracy: 0.5107 179/234 [=====================>........] - ETA: 1s - loss: 2.0369 - accuracy: 0.5108 181/234 [======================>.......] - ETA: 1s - loss: 2.0364 - accuracy: 0.5114 183/234 [======================>.......] - ETA: 1s - loss: 2.0359 - accuracy: 0.5113 185/234 [======================>.......] - ETA: 1s - loss: 2.0353 - accuracy: 0.51122023-09-23 06:59:05.553174: E tensorflow/core/common_runtime/ring_alg.cc:291] Aborting RingReduce with UNAVAILABLE: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} [type.googleapis.com/tensorflow.DerivedStatus=''] 2023-09-23 06:59:05.553257: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort UNAVAILABLE: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} [type.googleapis.com/tensorflow.DerivedStatus=''] 23/09/23 06:59:06 ERROR Executor: Exception in task 1.1 in stage 1.0 (TID 11) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 604, in main process() File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 594, in process out_iter = func(split_index, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func [Previous line repeated 1 more time] File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 418, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 932, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 467, in _mapfn File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 426, in wrapper_fn File "mnist_tf_ds.py", line 93, in main_fun File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:

Detected at node 'SGD/CollectiveReduceV2' defined at (most recent call last): File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 211, in File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 186, in manager File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 74, in worker File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 604, in main process() File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 594, in process out_iter = func(split_index, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func [Previous line repeated 1 more time] File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 418, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 932, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 467, in _mapfn File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 426, in wrapper_fn File "mnist_tf_ds.py", line 93, in main_fun File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1564, in fit tmp_logs = self.train_function(iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1160, in train_function return step_function(self, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1146, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/utils.py", line 177, in _all_reduce_sum_fn return distribution.extended.batch_reduce_to( Node: 'SGD/CollectiveReduceV2' Collective ops is aborted by: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. [[{{node SGD/CollectiveReduceV2}}]] [Op:__inference_train_function_1098]

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2290)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1465)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

23/09/23 06:59:06 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 14 23/09/23 06:59:06 INFO Executor: Running task 1.2 in stage 1.0 (TID 14) 2023-09-23 06:59:06.315771: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with on

yahoo / TensorFlowOnSpark

error while running mnist_tf_ds.py #607

Adapted from: https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras