zigvu / rasbari

Engine based Zigvu workflow tools.
0 stars 0 forks source link

SSH from storage server (az11) to GPU server (az12) stuck #13

Open DeepakZigvu opened 7 years ago

DeepakZigvu commented 7 years ago

As part of chia model evaluation, data transfer from storage to GPU server seem to be stuck. Progress is stuck at 0%.

arpgh commented 7 years ago

This is data copy issue for model training in AWS dev VMs. (so should this really be re-open of issue #8 ?) Even though build_inputs.tar.gz is correctly copied over, clips and base model are not. SSH login with keys is working ok. I checked firewalls among VMs and they look fine too. We had this working in Azure, so what are possible causes for hiccup?

DeepakZigvu commented 7 years ago

@eacharya Error displayed for upload on az11

E, [2017-06-30 01:41:56#26157] ERROR -- : file_transfer.rb: expected a file to upload (ArgumentError)

Could not upload the log file, so pasting below


2.3.1 :001 > I, [2017-06-30 01:28:40#26157] INFO -- : storage_server.rb: Start StorageServer
I, [2017-06-30 01:28:40#26157] INFO -- : storage_server.rb: Start StorageServer for hostname: az11
D, [2017-06-30 01:41:24#26157] DEBUG -- : storage_handler.rb: Request header : {:type=>"ping", :state=>"request"}
D, [2017-06-30 01:41:24#26157] DEBUG -- : storage_handler.rb: Request message: {"category":"general","name":"none","trace":""}
D, [2017-06-30 01:41:24#26157] DEBUG -- : storage_handler.rb: Served header : {:type=>"ping", :state=>"success"}
D, [2017-06-30 01:41:24#26157] DEBUG -- : storage_handler.rb: Served message: {"category":"general","name":"none","trace":""}
D, [2017-06-30 01:41:51#26157] DEBUG -- : storage_handler.rb: Request header : {:type=>"data", :state=>"request"}
D, [2017-06-30 01:41:51#26157] DEBUG -- : storage_handler.rb: Request message: {"hostname":"az10","type":"put","clientFilePath":"/tmp/iterations/5785408ca287f2e790000002/build_inputs.tar.gz","serverFilePath":"/data/az11/chia_models/10/build_inputs.tar.gz","category":"storage","name":"file_operations","trace":""}
File transfer GET az10 clientFilePath /tmp/iterations/5785408ca287f2e790000002/build_inputs.tar.gz serverFilePath /data/az11/chia_models/10/build_inputs.tar.gz
I, [2017-06-30 01:41:51#26157] INFO -- : sftp_connections_cache.rb: (re)Start SFTP connetion to az10
D, [2017-06-30 01:41:55#26157] DEBUG -- : storage_handler.rb: Served header : {:type=>"data", :state=>"success"}
D, [2017-06-30 01:41:55#26157] DEBUG -- : storage_handler.rb: Served message: {"hostname":"az10","type":"put","clientFilePath":"/tmp/iterations/5785408ca287f2e790000002/build_inputs.tar.gz","serverFilePath":"/data/az11/chia_models/10/build_inputs.tar.gz","category":"storage","name":"file_operations","trace":"File operation successful: /data/az11/chia_models/10/build_inputs.tar.gz"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request header : {:type=>"ping", :state=>"request"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request message: {"category":"general","name":"none","trace":""}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served header : {:type=>"ping", :state=>"success"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served message: {"category":"general","name":"none","trace":""}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request header : {:type=>"data", :state=>"request"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request message: {"hostname":"az12","type":"get","clientFilePath":"/tmp/chia/5785408ca287f2e790000002/build_inputs.tar.gz","serverFilePath":"/data/az11/chia_models/10/build_inputs.tar.gz","category":"storage","name":"file_operations","trace":""}
File transfer PUT az12 clientFilePath /tmp/chia/5785408ca287f2e790000002/build_inputs.tar.gz serverFilePath /data/az11/chia_models/10/build_inputs.tar.gz
I, [2017-06-30 01:41:56#26157] INFO -- : sftp_connections_cache.rb: (re)Start SFTP connetion to az12
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served header : {:type=>"data", :state=>"success"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served message: {"hostname":"az12","type":"get","clientFilePath":"/tmp/chia/5785408ca287f2e790000002/build_inputs.tar.gz","serverFilePath":"/data/az11/chia_models/10/build_inputs.tar.gz","category":"storage","name":"file_operations","trace":"File operation successful: /data/az11/chia_models/10/build_inputs.tar.gz"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request header : {:type=>"data", :state=>"request"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Request message: {"hostname":"az12","type":"get","clientFilePath":"/tmp/chia/5785408ca287f2e790000002/ZF.v2.caffemodel","serverFilePath":"/data/az11/chia_models/1/ZF.v2.caffemodel","category":"storage","name":"file_operations","trace":""}
File transfer PUT az12 clientFilePath /tmp/chia/5785408ca287f2e790000002/ZF.v2.caffemodel serverFilePath /data/az11/chia_models/1/ZF.v2.caffemodel
I, [2017-06-30 01:41:56#26157] INFO -- : sftp_connections_cache.rb: Stop SFTP connetion to az12
I, [2017-06-30 01:41:56#26157] INFO -- : sftp_connections_cache.rb: (re)Start SFTP connetion to az12
E, [2017-06-30 01:41:56#26157] ERROR -- : file_transfer.rb: expected a file to upload (ArgumentError)
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/net-sftp-2.1.2/lib/net/sftp/operations/upload.rb:176:in `initialize'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:98:in `new'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:98:in `upload'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:103:in `upload!'
/home/ubuntu/nimki/servers/storage/connections/sftp_connections_cache.rb:48:in `upload!'
/home/ubuntu/nimki/servers/storage/connections/file_transfer.rb:20:in `block in put'
/home/ubuntu/nimki/servers/storage/connections/file_transfer.rb:39:in `traceWrapper'
/home/ubuntu/nimki/servers/storage/connections/file_transfer.rb:19:in `put'
/home/ubuntu/nimki/servers/storage/handlers/file_operations.rb:25:in `handle'
/home/ubuntu/nimki/servers/storage/handlers/storage_handler.rb:24:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_server.rb:21:in `block in start'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer.rb:56:in `call'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/channel.rb:1722:in `block in handle_frameset'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer_work_pool.rb:94:in `block (2 levels) in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer_work_pool.rb:89:in `loop'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer_work_pool.rb:89:in `block in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer_work_pool.rb:88:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/bunny-2.2.2/lib/bunny/consumer_work_pool.rb:88:in `run_loop'
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served header : {:type=>"data", :state=>"failure"}
D, [2017-06-30 01:41:56#26157] DEBUG -- : storage_handler.rb: Served message: {"hostname":"az12","type":"get","clientFilePath":"/tmp/chia/5785408ca287f2e790000002/ZF.v2.caffemodel","serverFilePath":"/data/az11/chia_models/1/ZF.v2.caffemodel","category":"storage","name":"file_operations","trace":"Error: /home/ubuntu/.rvm/gems/ruby-2.3.1/gems/net-sftp-2.1.2/lib/net/sftp/operations/upload.rb:176:in `initialize'"}
``
eacharya commented 7 years ago

@DeepakZigvu - my guess is that the files doesn't exist in the storage server:

ubuntu@az11:~$ ll /data/az11/chia_models/1/ZF.v2.caffemodel
ls: cannot access '/data/az11/chia_models/1/ZF.v2.caffemodel': No such file or directory
ubuntu@az11:~$ ll /data/az11/chia_models/
total 12
drwxrwxr-x 3 ubuntu ubuntu 4096 Jun 25 21:35 ./
drwxrwxr-x 4 ubuntu ubuntu 4096 Jun 25 21:35 ../
drwxrwxr-x 2 ubuntu ubuntu 4096 Jun 25 21:35 10/

What am I missing?

If that is the case, @arpgh - can we move the base model to the right location?

Also, @arpgh - we should increase the disk sizes in both VMs since df -h shows the disks almost full. If there is not enough space for clips transfer and frame extraction, the build will fail in non-intuitive ways.

DeepakZigvu commented 7 years ago

@eacharya Should the chia_model files be in az10 or az11 and should the directory be az10 or az11? Should they be compied from az10 to az11 as part of the process or should they already be in az11?

arpgh commented 7 years ago

I've doubled the disk size in az11 but had to restart az11 to extend the partition. @DeepakZigvu you'll need to run the storage server in screen again. (help save disk space by deleting large redundant backup files like tar.gz)

Also, you had backed up the models, so moving them in right locations in az11 should help. It looks you'd earlier modified to serve them from azvm10, but by default the models are served from storage server. Reviewing that location in code should also clarify that for you.

DeepakZigvu commented 7 years ago

I have moved the models in azvm11 to az11. Few of the tar files in az11 are also deleted.

I found that within chia_models/10, there was another sub-directory 10. So, I have moved the model and tar file to the upper directory so that it is similar to the other model directories.

After these changes, I don't see any error on the consoles. But it is stills stuck at 0%.

I have not changed any code to serve them from azvm10. The only change is made in the UI for server hostname and ip. Only the directories were renamed to match the hostname based on what they were named earlier.

---- On Mon, 03 Jul 2017 17:49:02 -0700 arpgh <notifications@github.com> wrote ----

I've doubled the disk size in az11 but had to restart az11 to extend the partition. @DeepakZigvu you'll need to run the storage server in screen again. (help save disk space by deleting large redundant backup files like tar.gz)

Also, you had backed up the models, so moving them in right locations in az11 should help. It looks you'd earlier modified to serve them from azvm10, but by default the models are served from storage server. Reviewing that location in code should also clarify that for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

arpgh commented 7 years ago

Files now seem in place in az11 but still not copied to GPU vm (az12). Are rabbitmq and sftp trigger working ok then?

DeepakZigvu commented 7 years ago

@arpgh @eacharya Running the ChiaModel more than once; i.e. after one successfully test and changed back to configuring before the next run, does not complete. @samosaState probably is not properly reset. Explicitly resetting the @samosaState resulted in proper copying of the clips.