Closed eacharya closed 7 years ago
Deepak is also added to Nimki repo.
A new dev VM (azvm11) is now up to run two ends of Rabbit as described above. Same login format as azvm10 works for azvm11. IP address for azvm11 is 52.165.189.122.
Confusion Finder: When localization is not set and press 'Next', json sent cause error due to missing current_filters.
NoMethodError in Analysis::Minings::SequenceViewerWorkflowController#show
undefined method `detectable_ids' for nil:NilClass
Extracted source (around line #17):
chiaModel = Kheer::ChiaModel.find(@mining.chia_model_id_loc)
@detectables = Kheer::Detectable.where(id: chiaModel.detectable_ids)
**@selectedDetectableIds = @mining.md_sequence_viewer.detectable_ids || []**
end
def handle(params)
Ping from AZVM10 using rails console is working fine.
Kheer::Iteration.find("56f380daa287f2ee8f000000").storageClient.isRemoteAlive?
D, [2017-03-21T03:54:33.864935 #55038] DEBUG -- : MONGODB | localhost:27017 | rasbari_development.find | STARTED | \
{"find"=>"kheer_iterations", "filter"=>{"_id"=>BSON::ObjectId('56f380daa287f2ee8f000000')}}
D, [2017-03-21T03:54:33.867159 #55038] DEBUG -- : MONGODB | localhost:27017 | rasbari_development.find | SUCCEEDED \
| 0.002097426s
Setting::Machine Load (0.4ms) SELECT setting_machines
.* FROM setting_machines
WHERE setting_machines
.id
\
= 1 LIMIT 1
=> [true, "Ping successful"]
Comment: Explanation
Terminology clarification:
In the build process, we collect all annotations for one or more clip/video/capture etc. and package them for khajuri.
Not clear on what the above few comments refer to - if you are trying to create a mining workflow, localization is needed for the selected videos. Also, if a workflow creation is dropped in the middle of the creation process, it might complain since all corner cases haven't probably been taken care of. The expectation is that the user will create a workflow in one sitting. If you have found a new bug or have questions on mining workflow, let us start a new issue - let us keep this issue for the training workflow.
If you are unclear about mining workflow vs. training workflow, perhaps Amrit can run you through a Skype screen share session to show how mining currently works in production machine. (Dev machines may not have all videos since they tend to be fairly large.)
Finally, it might be helpful to label each comment in this issue with a tag. Example: Documentation, question, explanation, bug. That way, it is easier for me to know when you are expecting a reply/action from me vs. just documenting.
With storage server running on AZVM11 (without fake sftp), transfer of build_inputs.tar.gz has been verifed /data/azvm11/chia_models/7.
Message on http://52.165.165.204:3000/kheer/iterations/572c478aa287f2b57b000000/workflow/ping_nimki GPU remote is alive but couldn't set model build details
Documentation
Updated database with synch with latest production. Put the videos in right folders and created symlink so that can see the video in a mining workflow:
http://52.165.165.204:3000/analysis/minings/5785d2d6a287f23dbe000000
Discussion
(Continuing from Nimki issue#5)
@DeepakZigvu - It is a good idea to include an avoid
detectable to the chia model since our mining workflow uses that. We already have a __AVOID__
detectable - you'll need to add that to the chia model.
@arpgh - Do you see any use case where a chia model would not have either an avoid
or any background
classes?
Also, this discussion can better belong to the rasbari repo instead of the nimki repo.
I don't see any use case like that. In fact, I think we'll need more variations of avoid
or background
at some point. So need to include here in the detectables before training.
@DeepakZigvu please keep all issues/comments related to current work in training workflow in this thread. And Tag
at top as we've done.
@eacharya Which interface allows adding those detectable to a new chia model? For the chia model TestNewChia (Version 3.0.0), #Annotations is 0 as well.
You'll need annotations to do the training. # Annotations
for 3.0.0 is 0 as shown. You should use Mini v1.1.3 which is left at Configuring
state and see if you can continue the training workflow from there. Once you click show
on that, you'll also see detectable list with # Annotations for each detectable.
@eacharya, @arpgh After running the workflow, error message was show. Seems like module issue for SecureRandom.
Console output for Samosa in AZVM11
D, [2017-04-07 01:30:13#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"status", :state=>"failure"}
D, [2017-04-07 01:30:13#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":"Error: /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Request header : {:type=>"ping", :state=>"request"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Request message: {"category":"general","name":"none","trace":""}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"ping", :state=>"success"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":""}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Request header : {:type=>"data", :state=>"request"}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Request message: {"iterationId":"572c478aa287f2b57b000000","chiaModelId":"7","parentChiaModelId":"4","needsTempParent":"false","storageHostname":"azvm11","storageBuildInputPath":"/data/azvm11/chia_models/7/build_inputs.tar.gz","storageModelPath":"","storageParentModelPath":"/data/azvm11/chia_models/4/zf_faster_rcnn_iter_20000.caffemodel","modelBuildPath":"","category":"samosa","name":"chia_details","trace":""}
E, [2017-04-07 01:31:09#37196] ERROR -- : samosa_handler.rb: uninitialized constant Messaging::Connections::RpcClient::SecureRandom
Did you mean? SecurityError (NameError)
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/generic_client.rb:25:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/generic_client.rb:34:in `isRemoteAlive?'
/home/ubuntu/nimki/servers/samosa/handlers/chia_details.rb:16:in `handle'
/home/ubuntu/nimki/servers/samosa/handlers/samosa_handler.rb:29:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_server.rb:21:in `block in start'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer.rb:56:in `call'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/channel.rb:1722:in `block in handle_frameset'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:97:in `block (2 levels) in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:92:in `loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:92:in `block in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:91:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:91:in `run_loop'
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"status", :state=>"failure"}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":"Error: /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'"}
@eacharya, @arpgh : Test2ChiaModel/Test2ChiaModelMini2 is in building state.
require 'securerandom'
is addeded to /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb in AZVM11 Build State: downloading Build Progress: 0%
Question
For Test2ChiaModel v4.1.1, # Annotations is showing 0, so that could be causing the failure with that. For v1.1.3, Build State shows Building but progress shows Failed. With fakeGpu on, what is expected behavior here? Does it return anything?
@eacharya , While running the training, the following error is encountered.
clip_ids.json
zigvu_config_train.json
zigvu_config_train_parent.json
2.2.1 :001 >
2.2.1 :002 > D, [2017-04-22 04:14:54#2444] DEBUG -- : build_manager.rb: Extract frames
D, [2017-04-22 04:14:54#2444] DEBUG -- : samosa_state.rb: Changing state to: extracting - 0%
I, [2017-04-22 04:14:54#2444] INFO -- : file_manager.rb: System: /home/ubuntu/samosa/tools/bin/extract_frames_from_video.py --video_path /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 --frame_numbers /tmp/chia/5785408ca287f2e790000002/build_data/3166/frame_numbers.txt --output_path /tmp/chia/5785408ca287f2e790000002/build_data/3166
DEBUG:root:Start extracting frames from /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4
DEBUG:root:Extracting at 1 FPS
Traceback (most recent call last):
File "/home/ubuntu/samosa/tools/bin/extract_frames_from_video.py", line 36, in <module>
fnFileMap = frameExtractor.extract_based_on_file(args.frame_numbers_file)
File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 30, in extract_based_on_file
return self.extract_non_sequential(frameNumbers)
File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 35, in extract_non_sequential
self._frame_extract()
File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 54, in _frame_extract
"{}/%04d.png".format(self.tempFramesPath)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ffmpeg', '-i', '/tmp/chia/5785408ca287f2e790000002/clips/3166.mp4', '-loglevel', 'panic', '-vf', "select='not(mod(n\\,1))'", '-f', 'image2', '-q:v', '0', '-vsync', '0', '/tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png']' returned non-zero exit status 1
D, [2017-04-22 04:14:59#2444] DEBUG -- : samosa_state.rb: Changing state to: failed - 0%
For line 54 "{}/%04d.png".format(self.tempFramesPath) in frame_extractor.py, it is being resolved to /tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png. Does ffmpeg automatically converts %04d.png to the correct filename?
Can you verify that the file exists? If it does, can we try that we can run ffmpeg outside of python - splat the args and run from command line.
For all external python calls, they should be executable stand-alone.
On Apr 22, 2017 6:38 AM, "DeepakZigvu" notifications@github.com wrote:
Discussion: Issue for training due to error
@eacharya https://github.com/eacharya , While running the training, the following error is encountered.
clip_ids.json zigvu_config_train.json zigvu_config_train_parent.json 2.2.1 :001 > 2.2.1 :002 > D, [2017-04-22 04:14:54#2444] DEBUG -- : build_manager.rb: Extract frames D, [2017-04-22 04:14:54#2444] DEBUG -- : samosa_state.rb: Changing state to: extracting - 0% I, [2017-04-22 04:14:54#2444] INFO -- : file_manager.rb: System: /home/ubuntu/samosa/tools/bin/extract_frames_from_video.py --video_path /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 --frame_numbers /tmp/chia/5785408ca287f2e790000002/build_data/3166/frame_numbers.txt --output_path /tmp/chia/5785408ca287f2e790000002/build_data/3166 DEBUG:root:Start extracting frames from /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 DEBUG:root:Extracting at 1 FPS Traceback (most recent call last): File "/home/ubuntu/samosa/tools/bin/extract_frames_from_video.py", line 36, in
fnFileMap = frameExtractor.extract_based_on_file(args.frame_numbers_file) File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 30, in extract_based_on_file return self.extract_non_sequential(frameNumbers) File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 35, in extract_non_sequential self._frame_extract() File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 54, in _frame_extract "{}/%04d.png".format(self.tempFramesPath) File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ffmpeg', '-i', '/tmp/chia/5785408ca287f2e790000002/clips/3166.mp4', '-loglevel', 'panic', '-vf', "select='not(mod(n\,1))'", '-f', 'image2', '-q:v', '0', '-vsync', '0', '/tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png']' returned non-zero exit status 1 D, [2017-04-22 04:14:59#2444] DEBUG -- : samosa_state.rb: Changing state to: failed - 0% For line 54 "{}/%04d.png".format(self.tempFramesPath) in frame_extractor.py, it is being resolved to /tmp/chia/ 5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png. Does ffmpeg automatically converts %04d.png to the correct filename?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigvu/rasbari/issues/8#issuecomment-296347008, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOdgF0esVEdBN-wtd41klNu5bKLbZZ5ks5ryYQ6gaJpZM4MMtqu .
@eacharya There is no file name %04d.png. There are other png files with 4-digit png filenames though. The filename being referenced is /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 which does exist.
If file exists and you are able to see some PNG files, my hunch is there might be something else going on. For example you might be out of disk space. To find out, if you run the command from command line with ffmpeg log level as debug, you should get more descriptive error.
On Apr 23, 2017 5:33 PM, "DeepakZigvu" notifications@github.com wrote:
@eacharya https://github.com/eacharya There is no file name %04d.png. There are other png files with 4-digit png filenames though. The filename being referenced is /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 which does exist.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigvu/rasbari/issues/8#issuecomment-296451386, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOdgFLMQFKNRVgnveyL-osv_dMRLXMGks5ry284gaJpZM4MMtqu .
@eacharya For the failure of ffmpeg, it seems to be the disk issue. Worked fine after additional storage for /tmp/chia.
After the storage fix, 2K iteration and build update in Rasbari looked fine. Here are some points from the call:
@DeepakZigvu - if training iteration works as expected, let us close this issue:
If more test need to be done, let us capture the required test here and track the results.
Closing the issue as the recommended tasks are complete.
Objective
The end goal is to understand and debug the end-to-and training of a brand new model. Since this touches a core part of our business, it involves a variety of modules that need to be orchestrated:
caffe
Training procedure
Models are found in 3 flavors:
Models are defined in a tree structure with major as the root, minor as the branches and mini as the leaves. To reduce branching in the tree, a mini series must be completed prior to starting a minor model. The tree structure thus provides a linear ancestry. For each minor model, the number of detectables changes - for caffe to be happy with this change in model definition, there is a special flag which gets sent to Chia.
Model training consists of running a set of examples through a model architecture - fine-tuning the existing weights. By the process of repeated fine-tuning across a set of mini models, we refine our model to high accuracy.
The examples for fune-tuning is generated by a human through a mining workflow. For each training phase, they are collected and fed to Chia through an
iteration workflow
. As part of this workflow, database entries are saved to a file, transferred to the storage server and a GPU training is kicked off. Once the training completes, the model is saved in the storage server and the workflow is complete. As noted above, the part that glues Chia and Kheer is Nimki.Nimki is a Rabbit abstraction that is written in ruby (for now) and which uses the
messaging engine
found in Rasbari. As an exercise in understanding Nimki, it is recommended to start up various servers in Nimki and use the rails console to send messages to these servers.Recommended Tasks:
Please add/remove as you work your way through the iteration.