termination message - Githubissues

abscodeice commented 11 months ago

Hi guys, I'm having trouble interpreting this termination message. Can anyone tell me the most likely cause of it?

thanx, Amanda

Some Info: (0) Data: ultrasonic vocalizations (srate: 250kHz; nfft: 1024; step: 480) (1) training set (1200s), validation set (200s) // number of classes: 13 (including background) (2) Most frequent class = 2228 calls; Less frequent class = 58

2023-12-05 20:03:49,155 - vak.engine.model - INFO - Step 840 is a validation step; computing metrics on validation set
2023-12-05 20:07:15,948 - vak.engine.model - INFO - avg_acc: 0.7111, avg_levenshtein: 1071.0000, avg_segment_error_rate: 0.8430, avg_loss: 0.6894
2023-12-05 20:07:15,948 - vak.engine.model - INFO - Accuracy has not improved in 6 validation steps. Not saving max-val-acc checkpoint for this validation step.
2023-12-05 20:12:03,529 - vak.engine.model - INFO - Step 880 is a validation step; computing metrics on validation set
2023-12-05 20:15:32,652 - vak.engine.model - INFO - avg_acc: 0.7023, avg_levenshtein: 1409.5000, avg_segment_error_rate: 1.1216, avg_loss: 0.7149
2023-12-05 20:15:32,652 - vak.engine.model - INFO - Stopping training early, accuracy has not improved in 6 validation steps.
2023-12-05 20:15:32,652 - vak.engine.model - INFO - Saving checkpoint at:
C:\Users\EthogenesisLab\Documents\Amanda\results\results_231205_170826\TweetyNet\checkpoints\[checkpoint.pt](http://checkpoint.pt/)

NickleDave commented 11 months ago

Hi @abscodeice thank you for raising a detailed issue, and I'm sorry that message isn't clearer.

I think you're wondering if it's an error message?
If so, it's not; it's just telling you that vak stopped training the model because the accuracy as measured on the validation set had not improved after six "validation steps". This is expected behavior.

Just so we're on the same page: the validation step is when vak stops running training batches through the model (1 global "step" is one training batch), and computes metrics on the validation set with the current model, before resuming training. If the accuracy computed for the validation set increases, then vak saves a checkpoint (with max-val-acc in the filename).

The frequency with which vak does this is controlled by the val_step option in the config file. E.g., for the experiments with Bengalese finch song in the paper we set val_step = 400, meaning "every 400 steps / batches, stop and compute validation metrics" https://github.com/yardencsGitHub/tweetynet/blob/eab406f0590f4e90a36530cada52a78b0c676a80/article/data/configs/Bengalese_Finches/learncurve/revision/config_BFSongRepository_bl26lb16_learncurve.toml#L23

We have not yet tested extensively on mouse USVs (work in progress 🙂) but based on what we know from other animal vocalizations, I would guess that if the model has stopped improving after 6 validation steps, then that's probably as good as it's going to get.

The number of validation steps without any improvement that vak will run before stopping training early is controlled by the patience option in the config file, as defined here: https://vak.readthedocs.io/en/latest/reference/config.html#vak.config.train.TrainConfig.patience

For the Bengalese finch experiments in the paper we used patience=4, and you're using 6, so I would guess that your model is probably pretty well trained. https://github.com/yardencsGitHub/tweetynet/blob/eab406f0590f4e90a36530cada52a78b0c676a80/article/data/configs/Bengalese_Finches/learncurve/revision/config_BFSongRepository_bl26lb16_learncurve.toml#L25

The only exception might be if you have a very small val_step -- sometimes if you check too frequently, then at the start of training, the validation accuracy still goes up and down a lot, and so a small val_step combined with a relatively low patience could cause the model to stop early when it would have still been able to improve more.

If you want to be extra careful and troubleshoot this, you can do the following (in increasing order of complexity):

increase the value of patience
also increase the value of val_step
run an entire learning curve where you also use a range of training set sizes (with the larger values of patience and val_step) to really be sure that you're using enough training data. A learning curve is just like training except that you separately specify a range of training set sizes and a number of training replicates for each training set size. If you are using the 1.0 version of vak then you'll want to use the slightly modified config file format, an example is here: https://github.com/vocalpy/Nicholson-Cohen-SfN-2023-poster/blob/main/data/configs/TweetyNet/BFSongRepo_allbirds_TweetyNet_window_size_1000_learncurve.toml

Please let me know if that helps! Happy to answer more questions, and even jump on a Zoom call to give you some tech support if you need it. We're glad to see your lab is still using TweetyNet + vak. You can also feel free to join our form and ask questions there (to benefit from the hive mind 🙂): https://forum.vocalpy.org/

Please let me know

NickleDave commented 7 months ago

Going to close this -- just let us know if we can help you @abscodeice! You can also feel free to email me at nicholdav at gmail if you prefer

yardencsGitHub / tweetynet

termination message #221