Open zaccharieramzi opened 2 years ago
hi,What is your configuration version? Is it based on the information given by the author,I just used python3.6 and the configuration given by the author. Jax error will appear
hi, I am a student of sysu, GPU cannot be used in this jax version, it is better to use tpu, and it is better to use video memory >=48G after testing, A100, jaxlib1.69-1.73 is better, the first time I tried jax framework The problem has been troubled for a long time, I hope it can help you
Hi @tianzhijiaoziA , I am having same issues for weeks. Could you please share some more details on running it on TPU of A100? Currently I am running Python3.6 on the cluster which only supports up to JAX0.2.17, which sends these errors:
WARNING:tensorflow:From /home/fs01/dm852/venv/test/lib64/python3.6/site-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.
I1004 19:30:43.270267 22970125400960 tpu_client.py:54] Starting the local TPU driver. I1004 19:30:43.271347 22970125400960 xla_bridge.py:231] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I1004 19:30:43.271621 22970125400960 xla_bridge.py:231] Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host I1004 19:30:43.271768 22970125400960 xla_bridge.py:231] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. W1004 19:30:43.271850 22970125400960 xla_bridge.py:234] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) /home/fs01/dm852/venv/test/lib64/python3.6/site-packages/jax/lib/xla_bridge.py:374: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code. "jax.host_id has been renamed to jax.process_index. This alias " I1004 19:31:40.582314 22970125400960 checkpoints.py:223] Found no checkpoint files in /home/fs01/dm852/venv/pt_tomography/score_inverse_problems-main/checkpoints-meta I1004 19:31:40.584645 22970125400960 dataset_info.py:365] Load dataset info from /home/fs01/dm852/tensorflow_datasets/pt3701_512/1.0.0 I1004 19:31:40.587075 22970125400960 dataset_builder.py:351] Reusing dataset pt3701_512 (/home/fs01/dm852/tensorflow_datasets/pt3701_512/1.0.0) I1004 19:31:40.587217 22970125400960 logging_logger.py:34] Constructing tf.data.Dataset pt3701_512 for split train[:80%], from /home/fs01/dm852/tensorflow_datasets/pt3701_512/1.0.0 I1004 19:31:40.803931 22970125400960 dataset_builder.py:351] Reusing dataset pt3701_512 (/home/fs01/dm852/tensorflow_datasets/pt3701_512/1.0.0) I1004 19:31:40.804249 22970125400960 logging_logger.py:34] Constructing tf.data.Dataset pt3701_512 for split train[80%:90%], from /home/fs01/dm852/tensorflow_datasets/pt3701_512/1.0.0 I1004 19:31:41.548434 22970125400960 run_lib.py:149] Starting training loop at step 0.
Hi,
I have tried running the code on CPU (my setup is with Python 3.9, Ubuntu 16.04 on an 8-core machine), and I have had a segmentation fault:
Have you tried running the code on CPU, or is it a GPU-only code?