remove saving of training sets from `cnn_bilstm.train_utils.train`?

yardencsGitHub / tweetynet

Hybrid convolutional-recurrent neural networks for segmentation of birdsong and classification of elements

BSD 3-Clause "New" or "Revised" License

47 stars 9 forks source link

remove saving of training sets from `cnn_bilstm.train_utils.train`? #13

Closed NickleDave closed 4 years ago

NickleDave commented 5 years ago

Currently during training the cnn_bilstm.train_utils.train saves the reshaped spectrograms, as well as the reshaped and then scaled spectrograms, in the output folder. This is done to have a record of training data. However this takes up a lot of disk space. Also currently the spectrograms are saved this way even if they are not scaled, which is confusing. Using train_inds that is saved in each subfolder along with the scaler should produce the same results every time and eliminate any need to save a separate copy of the training data after reshaping. To be sure this is the case, tests should be added that guarantee it.

NickleDave commented 5 years ago

Also the train function makes copies of x_train, x_val, etc. in the main results directory it creates which again takes up space.

NickleDave commented 5 years ago

One approach would be to make saving copies of the data an option for paranoid scientists like me, i.e.,

[TRAIN]
save_copies_of_data = True

but then have it set to False to default

yardencsGitHub commented 5 years ago

I honestly don't understand why there's any need to create copies of data. Once the .spect files are there and the config.ini file points to their location all other operation just need pointers to the files. If needs be it may be useful to also save a normalized version of the spectrograms.

NickleDave commented 5 years ago

What if you want to share your data, including the exact data fed to the network, so people know they can reproduce your results?

Sure, you should only need pointers, as long as the programmer never makes any mistakes in re-producing the path from spectrograms to the data that the network ends up seeing, and/or you have written tests that verify that the program knows how to correctly traverse every possible path from spectrogram file to batched data fed to network (normalized? etc?)

Currently cnn_bilstm.train_utils.learning_curve uses the saved data instead of re-creating it. This avoids the possibility of calculating the wrong accuracy because some step between the spectrograms and feeding them to the network is or isn't carried out.

NickleDave commented 5 years ago

Also if something were to happen to the .ini file (e.g. you accidentally edited it), then you can verify from the saved data that it doesn't match what you get when you follow what's in the .ini

yardencsGitHub commented 5 years ago

Agreed. But you can simply save the .ini file that was used. Then, it is the programmer's responsibility not to take shortcuts. You can also create a very simple script that 'exports' data dicts when given a .ini file in case you want someone else to replicate the results

NickleDave commented 4 years ago

This functionality has been removed