yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 937 forks source link

error while running mnist_tf_ds.py #607

Open jordanFisherYzw opened 1 year ago

jordanFisherYzw commented 1 year ago

requirement.txt tensorflow==2.10.0 tensorflow-datasets tensorflow-io

2023-09-23 06:57:51,595 WARNING (MainThread-18213) From mnist_tf_ds.py:36: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. Instructions for updating: use distribute.MultiWorkerMirroredStrategy instead 2023-09-23 06:57:51.598567: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/compat:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/apache/hadoop/lib/native/:/apache/java/jre/lib/amd64/server/ 2023-09-23 06:57:51.598613: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303) 2023-09-23 06:57:51.598645: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hdc42-mcc10-01-0510-3605-052-tess0097): /proc/driver/nvidia/version does not exist 2023-09-23 06:57:51.600086: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-23 06:57:51.617826: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 10.97.207.26:37937} 2023-09-23 06:57:51.617870: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.179.131.146:36107, 1 -> 10.193.193.154:46431, 2 -> 10.216.144.168:36169} 2023-09-23 06:57:51.618003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 10.97.207.26:37937} 2023-09-23 06:57:51.618026: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.179.131.146:36107, 1 -> 10.193.193.154:46431, 2 -> 10.216.144.168:36169} 2023-09-23 06:57:51.619000: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://10.179.131.146:36107 2023-09-23 06:57:52.622546: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:281] Coordination agent has successfully connected. 2023-09-23 06:57:52,623 INFO (MainThread-18213) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:0/device:CPU:0', '/job:chief/replica:0/task:0/device:CPU:0', '/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:2/device:CPU:0'] 2023-09-23 06:57:52,659 INFO (MainThread-18213) Check health not enabled. 2023-09-23 06:57:52,659 INFO (MainThread-18213) MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['10.97.207.26:37937'], 'worker': ['10.179.131.146:36107', '10.193.193.154:46431', '10.216.144.168:36169']}, task_type = 'worker', task_id = 0, num_workers = 4, local_devices = ('/job:worker/task:0/device:CPU:0',), communication = CommunicationImplementation.AUTO image_pattern is viewfs://apollo-rno/apps/hdmi-ebayadvertising/dmp/qa/zhiwyu/tfos/mnist/tfr/tfr/train/part* 2023-09-23 06:57:59,107 WARNING (MainThread-18213) AutoGraph could not transform <function main_fun.. at 0x7fcdef703af0> and will run it as-is. Cause: could not parse the source code of <function main_fun.. at 0x7fcdef703af0>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert 2023-09-23 06:57:59,134 WARNING (MainThread-18213) AutoGraph could not transform <function main_fun..parse_tfos at 0x7fcdef748310> and will run it as-is. Cause: Unable to locate the source code of <function main_fun..parse_tfos at 0x7fcdef748310>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert 2023-09-23 06:57:59.977508: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations. Epoch 1/3 2023-09-23 06:58:00,045 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,285 INFO (MainThread-18213) Collective all_reduce tensors: 6 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,368 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,383 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,397 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,411 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:00,421 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,483 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,593 INFO (MainThread-18213) Collective all_reduce tensors: 6 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,638 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,647 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,659 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1 2023-09-23 06:58:01,674 INFO (MainThread-18213) Collective all_reduce tensors: 1 all_reduces, num_devices = 1, group_size = 4, implementation = CommunicationImplementation.AUTO, num_packs = 1

1/234 [..............................] - ETA: 25:42 - loss: 2.3018 - accuracy: 0.0703 3/234 [..............................] - ETA: 6s - loss: 2.2965 - accuracy: 0.0664  223/234 [===========================>..] - ETA: 0s - loss: 2.1595 - accuracy: 0.4079 225/234 [===========================>..] - ETA: 0s - loss: 2.1589 - accuracy: 0.4084 227/234 [============================>.] - ETA: 0s - loss: 2.1585 - accuracy: 0.4089 229/234 [============================>.] - ETA: 0s - loss: 2.1579 - accuracy: 0.4097 231/234 [============================>.] - ETA: 0s - loss: 2.1575 - accuracy: 0.4101 233/234 [============================>.] - ETA: 0s - loss: 2.1569 - accuracy: 0.4105 235/234 [==============================] - ETA: 0s - loss: 2.1563 - accuracy: 0.4113 Epoch 2: saving model to ./mnist_model/workertemp_0/weights-0002  234/234 [==============================] - 8s 34ms/step - loss: 2.1563 - accuracy: 0.4113 Epoch 3/3

1/234 [..............................] - ETA: 5s - loss: 2.0831 - accuracy: 0.5000 3/234 [..............................] - ETA: 6s - loss: 2.1071 - accuracy: 0.4349 5/234 [..............................] - ETA: 6s - loss: 2.0965 - accuracy: 0.4641 7/234 [..............................] - ETA: 6s - loss: 2.1002 - accuracy: 0.4542 9/234 [>.............................] - ETA: 6s - loss: 2.0954 - accuracy: 0.4627 11/234 [>.............................] - ETA: 6s - loss: 2.0917 - accuracy: 0.4727 13/234 [>.............................] - ETA: 6s - loss: 2.0909 - accuracy: 0.4766 15/234 [>.............................] - ETA: 6s - loss: 2.0904 - accuracy: 0.4784 17/234 [=>............................] - ETA: 6s - loss: 2.0892 - accuracy: 0.4848 19/234 [=>............................] - ETA: 6s - loss: 2.0890 - accuracy: 0.4846 21/234 [=>............................] - ETA: 6s - loss: 2.0876 - accuracy: 0.4821 23/234 [=>............................] - ETA: 6s - loss: 2.0874 - accuracy: 0.4791 25/234 [==>...........................] - ETA: 6s - loss: 2.0870 - accuracy: 0.4786 27/234 [==>...........................] - ETA: 6s - loss: 2.0867 - accuracy: 0.4784 29/234 [==>...........................] - ETA: 6s - loss: 2.0867 - accuracy: 0.4786 31/234 [==>...........................] - ETA: 5s - loss: 2.0863 - accuracy: 0.4797 33/234 [===>..........................] - ETA: 5s - loss: 2.0856 - accuracy: 0.4794 35/234 [===>..........................] - ETA: 5s - loss: 2.0853 - accuracy: 0.4783 37/234 [===>..........................] - ETA: 5s - loss: 2.0841 - accuracy: 0.4791 39/234 [===>..........................] - ETA: 5s - loss: 2.0841 - accuracy: 0.4780 41/234 [====>.........................] - ETA: 5s - loss: 2.0838 - accuracy: 0.4774 43/234 [====>.........................] - ETA: 5s - loss: 2.0829 - accuracy: 0.4774 45/234 [====>.........................] - ETA: 5s - loss: 2.0829 - accuracy: 0.4777 47/234 [=====>........................] - ETA: 5s - loss: 2.0822 - accuracy: 0.4787 49/234 [=====>........................] - ETA: 5s - loss: 2.0818 - accuracy: 0.4798 51/234 [=====>........................] - ETA: 5s - loss: 2.0811 - accuracy: 0.4809 53/234 [=====>........................] - ETA: 5s - loss: 2.0801 - accuracy: 0.4814 55/234 [======>.......................] - ETA: 5s - loss: 2.0790 - accuracy: 0.4832 57/234 [======>.......................] - ETA: 5s - loss: 2.0787 - accuracy: 0.4841 59/234 [======>.......................] - ETA: 5s - loss: 2.0781 - accuracy: 0.4836 61/234 [======>.......................] - ETA: 5s - loss: 2.0769 - accuracy: 0.4860 63/234 [=======>......................] - ETA: 5s - loss: 2.0761 - accuracy: 0.4864 65/234 [=======>......................] - ETA: 5s - loss: 2.0755 - accuracy: 0.4870 67/234 [=======>......................] - ETA: 5s - loss: 2.0747 - accuracy: 0.4882 69/234 [=======>......................] - ETA: 5s - loss: 2.0737 - accuracy: 0.4891 71/234 [========>.....................] - ETA: 4s - loss: 2.0738 - accuracy: 0.4886 73/234 [========>.....................] - ETA: 4s - loss: 2.0729 - accuracy: 0.4900 75/234 [========>.....................] - ETA: 4s - loss: 2.0726 - accuracy: 0.4902 77/234 [========>.....................] - ETA: 4s - loss: 2.0723 - accuracy: 0.4905 79/234 [=========>....................] - ETA: 4s - loss: 2.0714 - accuracy: 0.4903 81/234 [=========>....................] - ETA: 4s - loss: 2.0707 - accuracy: 0.4905 83/234 [=========>....................] - ETA: 4s - loss: 2.0697 - accuracy: 0.4921 85/234 [=========>....................] - ETA: 4s - loss: 2.0690 - accuracy: 0.4925 87/234 [==========>...................] - ETA: 4s - loss: 2.0682 - accuracy: 0.4922 89/234 [==========>...................] - ETA: 4s - loss: 2.0677 - accuracy: 0.4927 91/234 [==========>...................] - ETA: 4s - loss: 2.0671 - accuracy: 0.4930 93/234 [==========>...................] - ETA: 4s - loss: 2.0664 - accuracy: 0.4936 95/234 [===========>..................] - ETA: 4s - loss: 2.0659 - accuracy: 0.4942 97/234 [===========>..................] - ETA: 4s - loss: 2.0655 - accuracy: 0.4947 99/234 [===========>..................] - ETA: 4s - loss: 2.0648 - accuracy: 0.4955 101/234 [===========>..................] - ETA: 4s - loss: 2.0641 - accuracy: 0.4962 103/234 [============>.................] - ETA: 3s - loss: 2.0637 - accuracy: 0.4955 105/234 [============>.................] - ETA: 3s - loss: 2.0629 - accuracy: 0.4959 107/234 [============>.................] - ETA: 3s - loss: 2.0620 - accuracy: 0.4963 109/234 [============>.................] - ETA: 3s - loss: 2.0614 - accuracy: 0.4965 111/234 [=============>................] - ETA: 3s - loss: 2.0604 - accuracy: 0.4977 113/234 [=============>................] - ETA: 3s - loss: 2.0603 - accuracy: 0.4965 115/234 [=============>................] - ETA: 3s - loss: 2.0592 - accuracy: 0.4979 117/234 [=============>................] - ETA: 3s - loss: 2.0584 - accuracy: 0.4983 119/234 [==============>...............] - ETA: 3s - loss: 2.0578 - accuracy: 0.4985 121/234 [==============>...............] - ETA: 3s - loss: 2.0571 - accuracy: 0.4990 123/234 [==============>...............] - ETA: 3s - loss: 2.0561 - accuracy: 0.5001 125/234 [===============>..............] - ETA: 3s - loss: 2.0557 - accuracy: 0.4998 127/234 [===============>..............] - ETA: 3s - loss: 2.0551 - accuracy: 0.5001 129/234 [===============>..............] - ETA: 3s - loss: 2.0543 - accuracy: 0.5015 131/234 [===============>..............] - ETA: 3s - loss: 2.0538 - accuracy: 0.5019 133/234 [================>.............] - ETA: 3s - loss: 2.0530 - accuracy: 0.5026 135/234 [================>.............] - ETA: 3s - loss: 2.0524 - accuracy: 0.5026 137/234 [================>.............] - ETA: 2s - loss: 2.0515 - accuracy: 0.5034 139/234 [================>.............] - ETA: 2s - loss: 2.0507 - accuracy: 0.5042 141/234 [=================>............] - ETA: 2s - loss: 2.0500 - accuracy: 0.5050 143/234 [=================>............] - ETA: 2s - loss: 2.0492 - accuracy: 0.5053 145/234 [=================>............] - ETA: 2s - loss: 2.0485 - accuracy: 0.5058 147/234 [=================>............] - ETA: 2s - loss: 2.0477 - accuracy: 0.5062 149/234 [==================>...........] - ETA: 2s - loss: 2.0472 - accuracy: 0.5063 152/234 [==================>...........] - ETA: 2s - loss: 2.0464 - accuracy: 0.5068 155/234 [==================>...........] - ETA: 2s - loss: 2.0451 - accuracy: 0.5078 158/234 [===================>..........] - ETA: 2s - loss: 2.0444 - accuracy: 0.5075 160/234 [===================>..........] - ETA: 2s - loss: 2.0439 - accuracy: 0.5076 162/234 [===================>..........] - ETA: 2s - loss: 2.0433 - accuracy: 0.5080 164/234 [===================>..........] - ETA: 2s - loss: 2.0423 - accuracy: 0.5085 166/234 [====================>.........] - ETA: 2s - loss: 2.0417 - accuracy: 0.5085 168/234 [====================>.........] - ETA: 2s - loss: 2.0408 - accuracy: 0.5089 170/234 [====================>.........] - ETA: 2s - loss: 2.0401 - accuracy: 0.5097 172/234 [=====================>........] - ETA: 2s - loss: 2.0393 - accuracy: 0.5098 175/234 [=====================>........] - ETA: 1s - loss: 2.0383 - accuracy: 0.5106 177/234 [=====================>........] - ETA: 1s - loss: 2.0377 - accuracy: 0.5107 179/234 [=====================>........] - ETA: 1s - loss: 2.0369 - accuracy: 0.5108 181/234 [======================>.......] - ETA: 1s - loss: 2.0364 - accuracy: 0.5114 183/234 [======================>.......] - ETA: 1s - loss: 2.0359 - accuracy: 0.5113 185/234 [======================>.......] - ETA: 1s - loss: 2.0353 - accuracy: 0.51122023-09-23 06:59:05.553174: E tensorflow/core/common_runtime/ring_alg.cc:291] Aborting RingReduce with UNAVAILABLE: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} [type.googleapis.com/tensorflow.DerivedStatus=''] 2023-09-23 06:59:05.553257: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort UNAVAILABLE: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} [type.googleapis.com/tensorflow.DerivedStatus=''] 23/09/23 06:59:06 ERROR Executor: Exception in task 1.1 in stage 1.0 (TID 11) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 604, in main process() File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 594, in process out_iter = func(split_index, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func [Previous line repeated 1 more time] File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 418, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 932, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 467, in _mapfn File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 426, in wrapper_fn File "mnist_tf_ds.py", line 93, in main_fun File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:

Detected at node 'SGD/CollectiveReduceV2' defined at (most recent call last): File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 211, in File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 186, in manager File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/daemon.py", line 74, in worker File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 604, in main process() File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/pyspark.zip/pyspark/worker.py", line 594, in process out_iter = func(split_index, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func [Previous line repeated 1 more time] File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 418, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/pyspark.zip/pyspark/rdd.py", line 932, in func File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 467, in _mapfn File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000001/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 426, in wrapper_fn File "mnist_tf_ds.py", line 93, in main_fun File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1564, in fit tmp_logs = self.train_function(iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1160, in train_function return step_function(self, iterator) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/engine/training.py", line 1146, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/hadoop/2/yarn/local/usercache/b_qa_ebayadvertising/appcache/application_1695143424058_380275/container_e3828_1695143424058_380275_01_000003/Python/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/utils.py", line 177, in _all_reduce_sum_fn return distribution.extended.batch_reduce_to( Node: 'SGD/CollectiveReduceV2' Collective ops is aborted by: Collective ops is aborted by: Collective ops is aborted by: Socket closed Additional GRPC error information from remote target /job:worker/replica:0/task:1: :{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:worker/replica:0/task:2: :{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. Additional GRPC error information from remote target /job:chief/replica:0/task:0: :{"created":"@1695477545.552923440","description":"Error received from peer ipv4:10.97.207.26:37937","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.\nAdditional GRPC error information from remote target /job:worker/replica:0/task:2:\n:{"created":"@1695477545.546039371","description":"Error received from peer ipv4:10.216.144.168:36169","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Collective ops is aborted by: Socket closed\nAdditional GRPC error information from remote target /job:worker/replica:0/task:1:\n:{"created":"@1695477545.542820388","description":"Error received from peer ipv4:10.193.193.154:46431","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14}\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":14} The error could be from a previous operation. Restart your program to reset. [[{{node SGD/CollectiveReduceV2}}]] [Op:__inference_train_function_1098]

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2290)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1465)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

23/09/23 06:59:06 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 14 23/09/23 06:59:06 INFO Executor: Running task 1.2 in stage 1.0 (TID 14) 2023-09-23 06:59:06.315771: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with on

jordanFisherYzw commented 1 year ago

Adapted from: https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras

from future import absolute_import, division, print_function, unicode_literals

def get_local_env(cmd_str): import subprocess out = subprocess.Popen(cmd_str, stdout=subprocess.PIPE, shell=True) return out.stdout.read().decode('utf-8').strip()

def get_runtime_env(): print("----------------------\n") print("libjvm : " + get_local_env("find /apache/ -name libjvm.so")) print("----------------------\n") print("libhdfs.so : " + get_local_env("find /apache/ -name libhdfs.so")) print("----------------------\n") print("java related : " + get_local_env("echo $JAVA_HOME; echo $JRE_HOME; echo $JDK_HOME")) print("---------------------\n") print("classpath : " + get_local_env("echo $CLASSPATH")) print("---------------------\n") print(get_local_env("ls -alh /apache | grep hadoop")) print("---------------------\n") print(get_local_env("/apache/hadoop/bin/hadoop classpath --glob")) print("---------------------\n") print(get_local_env("ls -alh /apache/ | grep hadoop"))

def main_fun(args, ctx): """Example demonstrating loading TFRecords directly from disk (e.g. HDFS) without tensorflow_datasets.""" import tensorflow as tf import tensorflow_io as tfio from tensorflowonspark import compat get_runtime_env() strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() BUFFER_SIZE = args.buffer_size BATCH_SIZE = args.batch_size NUM_WORKERS = args.cluster_size

def parse_tfos(example_proto):
    feature_def = {"label": tf.io.FixedLenFeature(1, tf.int64),
                   "image": tf.io.FixedLenFeature(28 * 28 * 1, tf.int64)}
    features = tf.io.parse_single_example(example_proto, feature_def)
    image = tf.cast(features['image'], tf.float32) / 255
    image = tf.reshape(image, (28, 28, 1))
    label = tf.cast(features['label'], tf.int32)
    return (image, label)

image_pattern = ctx.absolute_path(args.images_labels)
print("image_pattern is {0}".format(image_pattern))
ds = tf.data.Dataset.list_files(image_pattern)
ds = ds.repeat(args.epochs).shuffle(BUFFER_SIZE)
ds = ds.interleave(lambda x: tf.data.TFRecordDataset(x, compression_type='GZIP'))
train_datasets_unbatched = ds.map(parse_tfos)

def build_and_compile_cnn_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
        metrics=['accuracy'])
    return model

# single node
# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3)

# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = BATCH_SIZE * NUM_WORKERS
train_datasets = train_datasets_unbatched.batch(GLOBAL_BATCH_SIZE)

# this fails
# callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=args.model_dir)]
tf.io.gfile.makedirs(args.model_dir)
filepath = args.model_dir + "/weights-{epoch:04d}"
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=filepath, verbose=1, save_weights_only=True)]

# Note: if you part files have an uneven number of records, you may see an "Out of Range" exception
# at less than the expected number of steps_per_epoch, because the executor with least amount of records will finish first.
steps_per_epoch = 60000 / GLOBAL_BATCH_SIZE

with strategy.scope():
    multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(x=train_datasets, epochs=args.epochs, steps_per_epoch=steps_per_epoch, callbacks=callbacks)

compat.export_saved_model(multi_worker_model, args.export_dir, ctx.job_name == 'chief')

if name == 'main': import argparse from pyspark.context import SparkContext from pyspark.conf import SparkConf import socket

notebook_ip = socket.gethostbyname(socket.gethostname())
notebook_port = "30202"
conf = SparkConf()
conf.setAppName("mnist_keras_tfrecord")
conf.set("spark.driver.host", notebook_ip)
conf.set("spark.driver.port", notebook_port)
sc = SparkContext(conf=conf)
executors = sc._conf.get("spark.executor.instances")
num_executors = int(executors) if executors is not None else 1

parser = argparse.ArgumentParser()
parser.add_argument("--batch_size", help="number of records per batch", type=int, default=64)
parser.add_argument("--buffer_size", help="size of shuffle buffer", type=int, default=10000)
parser.add_argument("--cluster_size", help="number of nodes in the cluster", type=int, default=num_executors)
parser.add_argument("--epochs", help="number of epochs", type=int, default=3)
parser.add_argument("--images_labels", help="HDFS path to MNIST image_label files in parallelized format")
parser.add_argument("--model_dir", help="path to save model/checkpoint", default="mnist_model")
parser.add_argument("--export_dir", help="path to export saved_model", default="mnist_export")
parser.add_argument("--tensorboard", help="launch tensorboard process", action="store_true")

args = parser.parse_args()
print("args:", args)

from tensorflowonspark import TFCluster

cluster = TFCluster.run(sc, main_fun, args, args.cluster_size, num_ps=0, tensorboard=args.tensorboard,
                        input_mode=TFCluster.InputMode.TENSORFLOW, master_node='chief')
cluster.shutdown()