wmayner / pyphi

A toolbox for integrated information theory.
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006343
Other
379 stars 97 forks source link

Scalability to large network #115

Closed HireTheHero closed 9 months ago

HireTheHero commented 9 months ago

Hi, thanks for your great work. I'm running IIT 4.0 12 nodes 2^12=4096 state-by-state network on Google Colab Pro+ A100 GPU runtime. Please see this for the sample code. Seems like the code is stack here (sorry, not sure about which of the lines gives the problem). Is this expected (state matrix too large for computation) or do you think the problem lies in the matrix itself? Thanks!

subsystem_backward = pyphi.Subsystem(
    network,
    state,
    nodes=node_labels,
    backward_tpm=True
)

subsystem_forward = pyphi.Subsystem(
    network,
    state,
    nodes=node_labels
)
wmayner commented 9 months ago

Can you post the traceback?

HireTheHero commented 9 months ago

No traceback, just running for days. Will kill it and try with different settings, but would be happy to hear any thoughts.

HireTheHero commented 9 months ago

Rerunning with pyphi.config.PROGRESS_BARS = True. Will share it

HireTheHero commented 9 months ago

Reduced to 256=2^8 states, added pyphi.config.PARALLEL = True option, and it looks like it takes 1-2 days to get a result. Is this expected, or is it taking too much? Any ideas about issues, speedups, and workarounds would be very helpful. Thanks!

2024-02-18 12:24:46,082 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
Evaluating partitions: 0it [00:00, ?it/s](raylet) WARNING: 512 PYTHON worker processes have been started on node: 93de8c32a26ba698aa82128f3e5af328164159419b78f9e5366cedcd with address: 150.65.182.83. This could result from using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Evaluating partitions: 10it [00:14,  1.48s/it]
(raylet) WARNING: 785 PYTHON worker processes have been started on node: 93de8c32a26ba698aa82128f3e5af328164159419b78f9e5366cedcd with address: 150.65.182.83. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Computing concepts:   0%|                                                                                                             | 0/255 [00:00<?, ?it/s](ProgressBarActor pid=780470) [2024-02-18 12:37:01,518 E 780470 780470] logging.cc:97: Unhandled exception: St12system_error. what(): Invalid argument
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,555 E 780470 780470] logging.cc:104: Stack trace: 
(ProgressBarActor pid=780470)  /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14779a159b5a] ray::operator<<()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14779a15c298] ray::TerminateHandler()
(ProgressBarActor pid=780470) /path/to/conda-env/bin/../lib/libstdc++.so.6(+0xb135a) [0x147798fe835a] __cxxabiv1::__terminate()
(ProgressBarActor pid=780470) /path/to/conda-env/bin/../lib/libstdc++.so.6(+0xb13c5) [0x147798fe83c5]
(ProgressBarActor pid=780470) /path/to/conda-env/bin/../lib/libstdc++.so.6(+0xb1658) [0x147798fe8658]
(ProgressBarActor pid=780470) /path/to/conda-env/bin/../lib/libstdc++.so.6(_ZSt20__throw_system_errori+0x85) [0x147798fdf5e8] std::__throw_system_error()
(ProgressBarActor pid=780470) /path/to/conda-env/bin/../lib/libstdc++.so.6(+0xdbe7e) [0x147799012e7e]
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0x7ce799) [0x14779993c799] ray::core::ConcurrencyGroupManager<>::Stop()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core28CoreWorkerDirectTaskReceiver4StopEv+0x7b) [0x147799922a0b] ray::core::CoreWorkerDirectTaskReceiver::Stop()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker8ShutdownEv+0x230) [0x1477998f5c30] ray::core::CoreWorker::Shutdown()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0xa2864e) [0x147799b9664e] EventTracker::RecordExecution()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0xa21a3e) [0x147799b8fa3e] std::_Function_handler<>::_M_invoke()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0xa21eb6) [0x147799b8feb6] boost::asio::detail::completion_handler<>::do_complete()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0x10d550b) [0x14779a24350b] boost::asio::detail::scheduler::do_run_one()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0x10d6e89) [0x14779a244e89] boost::asio::detail::scheduler::run()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0x10d7592) [0x14779a245592] boost::asio::io_context::run()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0xcd) [0x1477998c805d] ray::core::CoreWorker::RunTaskExecutionLoop()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x14779990a09c] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x14779990a24d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(ProgressBarActor pid=780470) /path/to/conda-env/lib/python3.10/site-packages/ray/_raylet.so(+0x5ae657) [0x14779971c657] __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x5002c4] method_vectorcall_NOARGS
(ProgressBarActor pid=780470) ray::ProgressBarActor(_PyEval_EvalFrameDefault+0x731) [0x4ee071] _PyEval_EvalFrameDefault
(ProgressBarActor pid=780470) ray::ProgressBarActor(_PyFunction_Vectorcall+0x6f) [0x4fd90f] _PyFunction_Vectorcall
(ProgressBarActor pid=780470) ray::ProgressBarActor(_PyEval_EvalFrameDefault+0x731) [0x4ee071] _PyEval_EvalFrameDefault
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x595062] _PyEval_Vector
(ProgressBarActor pid=780470) ray::ProgressBarActor(PyEval_EvalCode+0x87) [0x594fa7] PyEval_EvalCode
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x5c5e17] run_eval_code_obj
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x5c0f60] run_mod
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x4595b6] pyrun_file.cold
(ProgressBarActor pid=780470) ray::ProgressBarActor(_PyRun_SimpleFileObject+0x19f) [0x5bb4ef] _PyRun_SimpleFileObject
(ProgressBarActor pid=780470) ray::ProgressBarActor(_PyRun_AnyFileObject+0x43) [0x5bb253] _PyRun_AnyFileObject
(ProgressBarActor pid=780470) ray::ProgressBarActor(Py_RunMain+0x38d) [0x5b800d] Py_RunMain
(ProgressBarActor pid=780470) ray::ProgressBarActor(Py_BytesMain+0x39) [0x588299] Py_BytesMain
(ProgressBarActor pid=780470) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x14779b9a70b3] __libc_start_main
(ProgressBarActor pid=780470) ray::ProgressBarActor() [0x58814e]
(ProgressBarActor pid=780470) 
(ProgressBarActor pid=780470) *** SIGABRT received at time=1708227421 on cpu 77 ***
(ProgressBarActor pid=780470) PC: @     0x14779b9c618b  (unknown)  raise
(ProgressBarActor pid=780470)     @     0x14779bce13c0  (unknown)  (unknown)
(ProgressBarActor pid=780470)     @     0x147798fe835a  (unknown)  __cxxabiv1::__terminate()
(ProgressBarActor pid=780470)     @     0x147798fe8580  (unknown)  (unknown)
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,556 E 780470 780470] logging.cc:361: *** SIGABRT received at time=1708227421 on cpu 77 ***
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,556 E 780470 780470] logging.cc:361: PC: @     0x14779b9c618b  (unknown)  raise
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,556 E 780470 780470] logging.cc:361:     @     0x14779bce13c0  (unknown)  (unknown)
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,556 E 780470 780470] logging.cc:361:     @     0x147798fe835a  (unknown)  __cxxabiv1::__terminate()
(ProgressBarActor pid=780470) [2024-02-18 12:37:01,556 E 780470 780470] logging.cc:361:     @     0x147798fe8580  (unknown)  (unknown)
(ProgressBarActor pid=780470) Fatal Python error: Aborted
(ProgressBarActor pid=780470) 
(ProgressBarActor pid=780470) Stack (most recent call first):
(ProgressBarActor pid=780470)   File "/path/to/conda-env/lib/python3.10/site-packages/ray/_private/worker.py", line 847 in main_loop
(ProgressBarActor pid=780470)   File "/path/to/conda-env/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 282 in <module>
(ProgressBarActor pid=780470) 
(ProgressBarActor pid=780470) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 24)
Computing concepts:   0%|▍                                                                                                | 1/255 [08:33<36:12:56, 513.29s/iComputing concepts:   1%|▊                                                                                                | 2/255 [17:12<36:18:20, 516.60s/it](raylet) WARNING: 900 PYTHON worker processes have been started on node: 93de8c32a26ba698aa82128f3e5af328164159419b78f9e5366cedcd with address: 150.65.182.83. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(raylet) WARNING: 1032 PYTHON worker processes have been started on node: 93de8c32a26ba698aa82128f3e5af328164159419b78f9e5366cedcd with address: 150.65.182.83. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Computing concepts:   1%|█▏                                                                                               | 3/255 [32:28<48:56:28, 699.16s/iComputing concepts:   2%|█▌                                                                                               | 4/255 [39:04<40:23:49, 579.40s/iComputing concepts:   2%|█▉                                                                                               | 5/255 [40:01<27:09:34, 391.10s/iComputing concepts:   2%|██▎                                                                                              | 6/255 [43:19<22:30:52, 325.51s/iComputing concepts:   3%|██▋                                                                                              | 7/255 [51:32<26:11:08, 380.12s/iCom
isacdaavid commented 9 months ago

Hi. Above ~8 nodes already is pushing to the limits of what can be feasibly done in terms of evaluating partitions, etc., although the devil is in the details. Many quantities and aspects of IIT involve combinatorial explosions, so analysis of any sufficiently big system can only be done through approximations.

The link you shared points to the IIT 4.0 demo, but there aren't any big networks in it, so I guess you are sharing the wrong link?

HireTheHero commented 9 months ago

Fair enough, thanks for your notes. The notebook is with the tiny network, but I'm using it with the larger one.

For the approximation and/or combinatory explosion alleviation, maybe I can write a feature request or PR. Will take a look at your implementation. Thanks again!