Closed NickleDave closed 4 years ago
Also the train
function makes copies of x_train
, x_val
, etc. in the main results directory it creates which again takes up space.
One approach would be to make saving copies of the data an option for paranoid scientists like me, i.e.,
[TRAIN]
save_copies_of_data = True
but then have it set to False
to default
I honestly don't understand why there's any need to create copies of data. Once the .spect files are there and the config.ini file points to their location all other operation just need pointers to the files. If needs be it may be useful to also save a normalized version of the spectrograms.
What if you want to share your data, including the exact data fed to the network, so people know they can reproduce your results?
Sure, you should only need pointers, as long as the programmer never makes any mistakes in re-producing the path from spectrograms to the data that the network ends up seeing, and/or you have written tests that verify that the program knows how to correctly traverse every possible path from spectrogram file to batched data fed to network (normalized? etc?)
Currently cnn_bilstm.train_utils.learning_curve
uses the saved data instead of re-creating it.
This avoids the possibility of calculating the wrong accuracy because some step between the spectrograms and feeding them to the network is or isn't carried out.
Also if something were to happen to the .ini file (e.g. you accidentally edited it), then you can verify from the saved data that it doesn't match what you get when you follow what's in the .ini
Agreed. But you can simply save the .ini file that was used. Then, it is the programmer's responsibility not to take shortcuts. You can also create a very simple script that 'exports' data dicts when given a .ini file in case you want someone else to replicate the results
This functionality has been removed
Currently during training the
cnn_bilstm.train_utils.train
saves the reshaped spectrograms, as well as the reshaped and then scaled spectrograms, in the output folder. This is done to have a record of training data. However this takes up a lot of disk space. Also currently the spectrograms are saved this way even if they are not scaled, which is confusing. Usingtrain_inds
that is saved in each subfolder along with thescaler
should produce the same results every time and eliminate any need to save a separate copy of the training data after reshaping. To be sure this is the case, tests should be added that guarantee it.