Training workflow - Githubissues

eacharya commented 7 years ago

Objective

The end goal is to understand and debug the end-to-and training of a brand new model. Since this touches a core part of our business, it involves a variety of modules that need to be orchestrated:

Kheer: The pointer store for models (including tree relationship pointers) and human workflow - a part of Rasbari
Chia: The model creation engine that interfaces with caffe
Nimki: The link between kheer and chia

Training procedure

Models are found in 3 flavors:

Major: This encompasses a new domain (e.g., basketball logo)
Minor: This defines changes to the number of classes
Mini: This defines fine-tuning of the same number of classes

Models are defined in a tree structure with major as the root, minor as the branches and mini as the leaves. To reduce branching in the tree, a mini series must be completed prior to starting a minor model. The tree structure thus provides a linear ancestry. For each minor model, the number of detectables changes - for caffe to be happy with this change in model definition, there is a special flag which gets sent to Chia.

Model training consists of running a set of examples through a model architecture - fine-tuning the existing weights. By the process of repeated fine-tuning across a set of mini models, we refine our model to high accuracy.

The examples for fune-tuning is generated by a human through a mining workflow. For each training phase, they are collected and fed to Chia through an iteration workflow. As part of this workflow, database entries are saved to a file, transferred to the storage server and a GPU training is kicked off. Once the training completes, the model is saved in the storage server and the workflow is complete. As noted above, the part that glues Chia and Kheer is Nimki.

Nimki is a Rabbit abstraction that is written in ruby (for now) and which uses the messaging engine found in Rasbari. As an exercise in understanding Nimki, it is recommended to start up various servers in Nimki and use the rails console to send messages to these servers.

Recommended Tasks:

Install rabbit in dev machine and run Nimki servers and Keer clients (from rails console)
Follow the Kheer workflow of iteration creation up to Nimki ping state
Boot GPU machine and re-do workflow to do full training

Please add/remove as you work your way through the iteration.

arpgh commented 7 years ago

Deepak is also added to Nimki repo.

A new dev VM (azvm11) is now up to run two ends of Rabbit as described above. Same login format as azvm10 works for azvm11. IP address for azvm11 is 52.165.189.122.

DeepakZigvu commented 7 years ago

Confusion Finder: When localization is not set and press 'Next', json sent cause error due to missing current_filters.

DeepakZigvu commented 7 years ago

NoMethodError in Analysis::Minings::SequenceViewerWorkflowController#show

undefined method `detectable_ids' for nil:NilClass

Extracted source (around line #17):

    chiaModel = Kheer::ChiaModel.find(@mining.chia_model_id_loc)
    @detectables = Kheer::Detectable.where(id: chiaModel.detectable_ids)
    **@selectedDetectableIds = @mining.md_sequence_viewer.detectable_ids || []**
  end

  def handle(params)

DeepakZigvu commented 7 years ago

Ping from AZVM10 using rails console is working fine. Kheer::Iteration.find("56f380daa287f2ee8f000000").storageClient.isRemoteAlive? D, [2017-03-21T03:54:33.864935 #55038] DEBUG -- : MONGODB | localhost:27017 | rasbari_development.find | STARTED | \ {"find"=>"kheer_iterations", "filter"=>{"_id"=>BSON::ObjectId('56f380daa287f2ee8f000000')}} D, [2017-03-21T03:54:33.867159 #55038] DEBUG -- : MONGODB | localhost:27017 | rasbari_development.find | SUCCEEDED \ | 0.002097426s Setting::Machine Load (0.4ms) SELECT setting_machines.* FROM setting_machines WHERE setting_machines.id\ = 1 LIMIT 1 => [true, "Ping successful"]

eacharya commented 7 years ago

Comment: Explanation

Terminology clarification:

Localization - What khajuri gives us after running a chia model across a frame
Annotation - What a human has created (either manually using the UI or through validating the systems' output)

In the build process, we collect all annotations for one or more clip/video/capture etc. and package them for khajuri.

Not clear on what the above few comments refer to - if you are trying to create a mining workflow, localization is needed for the selected videos. Also, if a workflow creation is dropped in the middle of the creation process, it might complain since all corner cases haven't probably been taken care of. The expectation is that the user will create a workflow in one sitting. If you have found a new bug or have questions on mining workflow, let us start a new issue - let us keep this issue for the training workflow.

If you are unclear about mining workflow vs. training workflow, perhaps Amrit can run you through a Skype screen share session to show how mining currently works in production machine. (Dev machines may not have all videos since they tend to be fairly large.)

Finally, it might be helpful to label each comment in this issue with a tag. Example: Documentation, question, explanation, bug. That way, it is easier for me to know when you are expecting a reply/action from me vs. just documenting.

DeepakZigvu commented 7 years ago

List mining to edit: http://52.165.165.204:3000/analysis/minings/58b1dd8b4e01704752000001/edit Updated Type to SequenceViewer.
Next to http://52.165.165.204:3000/analysis/minings/58b1dd8b4e01704752000001/sequence_viewer_workflow/set_chia_models
- Localization (Major/Minor/Mini Chia Model): 1.0.0, 1.1.0, 1.1.1 respectively.
- Annotation (Major/Minor/Mini Chia Model): 1.0.0, 1.1.0, 1.1.1 respectively.
Next to http://52.165.165.204:3000/analysis/minings/58b1dd8b4e01704752000001/sequence_viewer_workflow/set_clips
- WorldCup 2014 -> Stitched video set 0 (preselected).
Next to http://52.165.165.204:3000/analysis/minings/58b1dd8b4e01704752000001/sequence_viewer_workflow/set_detectables
NoMethodError in Analysis::Minings::SequenceViewerWorkflowController#show

DeepakZigvu commented 7 years ago

With storage server running on AZVM11 (without fake sftp), transfer of build_inputs.tar.gz has been verifed /data/azvm11/chia_models/7.

Message on http://52.165.165.204:3000/kheer/iterations/572c478aa287f2b57b000000/workflow/ping_nimki GPU remote is alive but couldn't set model build details

eacharya commented 7 years ago

Documentation

Updated database with synch with latest production. Put the videos in right folders and created symlink so that can see the video in a mining workflow:

http://52.165.165.204:3000/analysis/minings/5785d2d6a287f23dbe000000

arpgh commented 7 years ago

Discussion (Continuing from Nimki issue#5)

From @eacharya

@DeepakZigvu - It is a good idea to include an avoid detectable to the chia model since our mining workflow uses that. We already have a __AVOID__ detectable - you'll need to add that to the chia model.

@arpgh - Do you see any use case where a chia model would not have either an avoid or any background classes?

Also, this discussion can better belong to the rasbari repo instead of the nimki repo.

I don't see any use case like that. In fact, I think we'll need more variations of avoid or background at some point. So need to include here in the detectables before training.

@DeepakZigvu please keep all issues/comments related to current work in training workflow in this thread. And Tag at top as we've done.

DeepakZigvu commented 7 years ago

Discussion

@eacharya Which interface allows adding those detectable to a new chia model? For the chia model TestNewChia (Version 3.0.0), #Annotations is 0 as well.

arpgh commented 7 years ago

You'll need annotations to do the training. # Annotations for 3.0.0 is 0 as shown. You should use Mini v1.1.3 which is left at Configuring state and see if you can continue the training workflow from there. Once you click show on that, you'll also see detectable list with # Annotations for each detectable.

DeepakZigvu commented 7 years ago

Discussion

@eacharya, @arpgh After running the workflow, error message was show. Seems like module issue for SecureRandom.

workflow_nogpu

Console output for Samosa in AZVM11

D, [2017-04-07 01:30:13#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"status", :state=>"failure"}
D, [2017-04-07 01:30:13#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":"Error: /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Request header : {:type=>"ping", :state=>"request"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Request message: {"category":"general","name":"none","trace":""}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"ping", :state=>"success"}
D, [2017-04-07 01:31:06#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":""}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Request header : {:type=>"data", :state=>"request"}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Request message: {"iterationId":"572c478aa287f2b57b000000","chiaModelId":"7","parentChiaModelId":"4","needsTempParent":"false","storageHostname":"azvm11","storageBuildInputPath":"/data/azvm11/chia_models/7/build_inputs.tar.gz","storageModelPath":"","storageParentModelPath":"/data/azvm11/chia_models/4/zf_faster_rcnn_iter_20000.caffemodel","modelBuildPath":"","category":"samosa","name":"chia_details","trace":""}
E, [2017-04-07 01:31:09#37196] ERROR -- : samosa_handler.rb: uninitialized constant Messaging::Connections::RpcClient::SecureRandom
Did you mean?  SecurityError (NameError)
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/generic_client.rb:25:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/generic_client.rb:34:in `isRemoteAlive?'
/home/ubuntu/nimki/servers/samosa/handlers/chia_details.rb:16:in `handle'
/home/ubuntu/nimki/servers/samosa/handlers/samosa_handler.rb:29:in `call'
/home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_server.rb:21:in `block in start'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer.rb:56:in `call'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/channel.rb:1722:in `block in handle_frameset'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:97:in `block (2 levels) in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:92:in `loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:92:in `block in run_loop'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:91:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/bunny-2.3.0/lib/bunny/consumer_work_pool.rb:91:in `run_loop'
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Served header : {:type=>"status", :state=>"failure"}
D, [2017-04-07 01:31:09#37196] DEBUG -- : samosa_handler.rb: Served message: {"category":"general","name":"none","trace":"Error: /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb:43:in `call'"}

DeepakZigvu commented 7 years ago

Discussion

@eacharya, @arpgh : Test2ChiaModel/Test2ChiaModelMini2 is in building state.

require 'securerandom'

is addeded to /home/ubuntu/rasbari/engines/messaging/lib/messaging/connections/rpc_client.rb in AZVM11 Build State: downloading Build Progress: 0%

arpgh commented 7 years ago

Question

For Test2ChiaModel v4.1.1, # Annotations is showing 0, so that could be causing the failure with that. For v1.1.3, Build State shows Building but progress shows Failed. With fakeGpu on, what is expected behavior here? Does it return anything?

DeepakZigvu commented 7 years ago

Discussion: Issue for training due to error

@eacharya , While running the training, the following error is encountered.

clip_ids.json
zigvu_config_train.json
zigvu_config_train_parent.json
2.2.1 :001 >
2.2.1 :002 >   D, [2017-04-22 04:14:54#2444] DEBUG -- : build_manager.rb: Extract frames
D, [2017-04-22 04:14:54#2444] DEBUG -- : samosa_state.rb: Changing state to: extracting - 0%
I, [2017-04-22 04:14:54#2444] INFO -- : file_manager.rb: System: /home/ubuntu/samosa/tools/bin/extract_frames_from_video.py --video_path /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 --frame_numbers /tmp/chia/5785408ca287f2e790000002/build_data/3166/frame_numbers.txt --output_path /tmp/chia/5785408ca287f2e790000002/build_data/3166
DEBUG:root:Start extracting frames from /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4
DEBUG:root:Extracting at 1 FPS
Traceback (most recent call last):
  File "/home/ubuntu/samosa/tools/bin/extract_frames_from_video.py", line 36, in <module>
    fnFileMap = frameExtractor.extract_based_on_file(args.frame_numbers_file)
  File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 30, in extract_based_on_file
    return self.extract_non_sequential(frameNumbers)
  File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 35, in extract_non_sequential
    self._frame_extract()
  File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 54, in _frame_extract
    "{}/%04d.png".format(self.tempFramesPath)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ffmpeg', '-i', '/tmp/chia/5785408ca287f2e790000002/clips/3166.mp4', '-loglevel', 'panic', '-vf', "select='not(mod(n\\,1))'", '-f', 'image2', '-q:v', '0', '-vsync', '0', '/tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png']' returned non-zero exit status 1
D, [2017-04-22 04:14:59#2444] DEBUG -- : samosa_state.rb: Changing state to: failed - 0%

For line 54 "{}/%04d.png".format(self.tempFramesPath) in frame_extractor.py, it is being resolved to /tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png. Does ffmpeg automatically converts %04d.png to the correct filename?

eacharya commented 7 years ago

Can you verify that the file exists? If it does, can we try that we can run ffmpeg outside of python - splat the args and run from command line.

For all external python calls, they should be executable stand-alone.

On Apr 22, 2017 6:38 AM, "DeepakZigvu" notifications@github.com wrote:

Discussion: Issue for training due to error

@eacharya https://github.com/eacharya , While running the training, the following error is encountered.

clip_ids.json zigvu_config_train.json zigvu_config_train_parent.json 2.2.1 :001 > 2.2.1 :002 > D, [2017-04-22 04:14:54#2444] DEBUG -- : build_manager.rb: Extract frames D, [2017-04-22 04:14:54#2444] DEBUG -- : samosa_state.rb: Changing state to: extracting - 0% I, [2017-04-22 04:14:54#2444] INFO -- : file_manager.rb: System: /home/ubuntu/samosa/tools/bin/extract_frames_from_video.py --video_path /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 --frame_numbers /tmp/chia/5785408ca287f2e790000002/build_data/3166/frame_numbers.txt --output_path /tmp/chia/5785408ca287f2e790000002/build_data/3166 DEBUG:root:Start extracting frames from /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 DEBUG:root:Extracting at 1 FPS Traceback (most recent call last): File "/home/ubuntu/samosa/tools/bin/extract_frames_from_video.py", line 36, in fnFileMap = frameExtractor.extract_based_on_file(args.frame_numbers_file) File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 30, in extract_based_on_file return self.extract_non_sequential(frameNumbers) File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 35, in extract_non_sequential self._frame_extract() File "/home/ubuntu/samosa/tools/frames/frame_extractor.py", line 54, in _frame_extract "{}/%04d.png".format(self.tempFramesPath) File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ffmpeg', '-i', '/tmp/chia/5785408ca287f2e790000002/clips/3166.mp4', '-loglevel', 'panic', '-vf', "select='not(mod(n\,1))'", '-f', 'image2', '-q:v', '0', '-vsync', '0', '/tmp/chia/5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png']' returned non-zero exit status 1 D, [2017-04-22 04:14:59#2444] DEBUG -- : samosa_state.rb: Changing state to: failed - 0%

For line 54 "{}/%04d.png".format(self.tempFramesPath) in frame_extractor.py, it is being resolved to /tmp/chia/ 5785408ca287f2e790000002/build_data/3166/temp_frames/%04d.png. Does ffmpeg automatically converts %04d.png to the correct filename?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigvu/rasbari/issues/8#issuecomment-296347008, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOdgF0esVEdBN-wtd41klNu5bKLbZZ5ks5ryYQ6gaJpZM4MMtqu .

DeepakZigvu commented 7 years ago

@eacharya There is no file name %04d.png. There are other png files with 4-digit png filenames though. The filename being referenced is /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 which does exist.

eacharya commented 7 years ago

If file exists and you are able to see some PNG files, my hunch is there might be something else going on. For example you might be out of disk space. To find out, if you run the command from command line with ffmpeg log level as debug, you should get more descriptive error.

On Apr 23, 2017 5:33 PM, "DeepakZigvu" notifications@github.com wrote:

@eacharya https://github.com/eacharya There is no file name %04d.png. There are other png files with 4-digit png filenames though. The filename being referenced is /tmp/chia/5785408ca287f2e790000002/clips/3166.mp4 which does exist.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zigvu/rasbari/issues/8#issuecomment-296451386, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOdgFLMQFKNRVgnveyL-osv_dMRLXMGks5ry284gaJpZM4MMtqu .

DeepakZigvu commented 7 years ago

@eacharya For the failure of ffmpeg, it seems to be the disk issue. Worked fine after additional storage for /tmp/chia.

arpgh commented 7 years ago

After the storage fix, 2K iteration and build update in Rasbari looked fine. Here are some points from the call:

Any restart required after a new GPU vm to set host names or join rabbitmq cluster can be problematic
Why use sftp and not use rabbitmq itself for data download? Is that better managed bandwidth since control and data will run in different channels?
Which errors need to float up to Rasbari UI?
Put checks in training workflow, eg. annotation count > 0 (issue# 9), what else?
What is next - capture workflow?

eacharya commented 7 years ago

@DeepakZigvu - if training iteration works as expected, let us close this issue:

Any code change needed for the this iteration to work needs to be committed to issue8 branch. That branch then needs to be merged with development.
For any improvement that is not necessary for issue8, let us create a new issue and track code changes there.

If more test need to be done, let us capture the required test here and track the results.

DeepakZigvu commented 7 years ago

Closing the issue as the recommended tasks are complete.

zigvu / rasbari

Training workflow #8

Objective

Training procedure

Recommended Tasks:

NoMethodError in Analysis::Minings::SequenceViewerWorkflowController#show

From @eacharya

Discussion

Discussion

Discussion

Discussion: Issue for training due to error