yardencsGitHub / tweetynet

Hybrid convolutional-recurrent neural networks for segmentation of birdsong and classification of elements
BSD 3-Clause "New" or "Revised" License
47 stars 9 forks source link

reproducing issue with segment error rate #81

Closed NickleDave closed 3 years ago

NickleDave commented 3 years ago

@yardencsGitHub documenting this here, in case you can help troubleshoot

I was able to reproduce the difference in segment error rate we see.

To do so I used the closest conda environment (I think) I have to what I was using when I ran the experiments on 2020-05-02 that we submitted. I made a file to exactly reproduce that environment, a copy of which is attached to this issue.

Then I did editable installs of copies of both the tweetynet and vak repos into that environment. Finally I checked out a commit in both those repos that was close in time to 2020-05-02, and ran prep followed by learncurve for Bird 0 from the BirdsongRecognition dataset.

steps to reproduce:

$ conda create --name vak-2020-05-02-env --file vak-2020-05-02-best-guess-spec-file.txt  # file attached to this issue
# ... env creation happens ...
$ conda activate vak-2020-05-02-env
(vak-2020-05-02-env) $ git clone git@github.com:yardencsGitHub/tweetynet.git tweetynet-2020-05-02
(vak-2020-05-02-env) $ git clone git@github.com:NickleDave/vak.git vak-2020-05-02
(vak-2020-05-02-env) $ cd vak-2020-05-02
# doesn't matter whether we install or checkout first -- the beauty (and terror) of editable installs
(vak-2020-05-02-env) $ pip install -e .
(vak-2020-05-02-env) $  git checkout 40645d86d66d07ae8275b8a8bbbd28bfe9713b48
(vak-2020-05-02-env) $ cd ../tweetynet-2020-05-02
(vak-2020-05-02-env) $ pip install -e .
(vak-2020-05-02-env) $ git checkout 8616c86c4bb8dfd399de8ba0266edff075e02562
(vak-2020-05-02-env) $ vak prep src/configs/config_BirdsongRecognition_bird00.toml
(vak-2020-05-02-env) $ vak learncurve src/configs/config_BirdsongRecognition_bird00.toml

my current candidates for what's causing the difference

vak-2020-05-02-best-guess-spec-file.txt

NickleDave commented 3 years ago

update: well it definitely starts with the spectrograms.

range of values in spectrograms from the "old" version is ~1e5-1e6, range from the "new" version (default parameters) is 1e-4 to 1e-5, as computed using np.ptp. Digging into why now

file 0, filename: 0.wav.spect.npz
spect shape: (257, 3822)
spect 0502 range: 118237.32749945407, spect now range: 0.00011011709226244508
file 1, filename: 1.wav.spect.npz
spect shape: (257, 4429)
spect 0502 range: 558753.4841373258, spect now range: 0.0005203797334221432
file 2, filename: 10.wav.spect.npz
spect shape: (257, 2261)
spect 0502 range: 10700.777428964404, spect now range: 9.96587558553033e-06
file 3, filename: 100.wav.spect.npz
spect shape: (257, 2048)
spect 0502 range: 49320.97004524937, spect now range: 4.593373280507454e-05
file 4, filename: 101.wav.spect.npz
spect shape: (257, 6358)
spect 0502 range: 1383690.2956972122, spect now range: 0.0012886620086591805
file 5, filename: 102.wav.spect.npz
spect shape: (257, 7009)
spect 0502 range: 207145.06435132195, spect now range: 0.00019291887465056214
file 6, filename: 103.wav.spect.npz
spect shape: (257, 2274)
spect 0502 range: 96792.19845030694, spect now range: 9.014475946343219e-05
file 7, filename: 104.wav.spect.npz
spect shape: (257, 8066)
spect 0502 range: 280292.1757529632, spect now range: 0.00026104243076682386
file 8, filename: 105.wav.spect.npz
spect shape: (257, 7127)
spect 0502 range: 315669.32687975815, spect now range: 0.00029398997023679144
file 9, filename: 106.wav.spect.npz
spect shape: (257, 6924)
spect 0502 range: 181780.96820718853, spect now range: 0.00016929671932681327
NickleDave commented 3 years ago

I think this may go against our intuition that what matters is having smoothly-varying values in the spectrogram.

When I visualize them with imshow they look exactly the same ... because imshow normalizes so that the min is 0 and the max is 1.

Obviously the old version must be doing some log transform by default? Question is what's different about the distribution of values. Seems like normalizing has a very different effect on the distribution of values going into the network when we start from the log transformed version.

yardencsGitHub commented 3 years ago

Do you recall if the default was log(x) or log (1+x)?

NickleDave commented 3 years ago

I am looking now to see if anything changed about the default.

I added log(1+x) when you asked me to, but I do not think I ever used it.

NickleDave commented 3 years ago

confirming that the config file parsing bug in the 2020-05-02 behaves as I explained before: SPECT_PARAMS did not get parsed, since cli.prep passed in the sections argument with 'SPECTROGRAM' as the name of the section to parse (so SPECT_PARAMS got skipped, even though it was recognized as a valid section name.)

this results in the following default values for the spectrogram config:

spect_params=SpectParamsConfig(fft_size=512, step_size=64, freq_cutoffs=None, thresh=None, transform_type=None, spect_key='s', freqbins_key='f', timebins_key='t', audio_path_key='audio_path')

can also confirm these are still the parameters being used by the time we get inside the io.audio.to_spect function that actually calls spect.spectrogram

SpectParamsConfig(fft_size=512, step_size=64, freq_cutoffs=None, thresh=None, transform_type=None, spect_key='s', freqbins_key='f', timebins_key='t', audio_path_key='audio_path')

so I think this is likely to be due to some change in matplotlib.mlab.specgram and/or scipy.signal 😬 not something we did

NickleDave commented 3 years ago

yup. before we even get to anything about thresholds in the spectrogram function, we already have this range of values. The range returned by matplotlib.mlib.specgram has changed I think.

-> spect, freqbins, timebins = specgram(dat, fft_size, samp_freq, noverlap=noverlap)[:3]
((Pdb)) noverlap
448
((Pdb)) n
> /home/bart/Documents/repos/coding/birdsong/vak-2020-05-02/src/vak/util/spect.py(71)spectrogram()
-> if transform_type:
((Pdb)) spect.min()
1.2016753755074585e-12
((Pdb)) spect.max()
118237.32749945407
NickleDave commented 3 years ago

My fault for depending on this function. Which is probably not a priority for the matplotlib maintainers

yardencsGitHub commented 3 years ago

Ah. The curse of python packages finally reared its ugly head. @NickleDave, do you mind if I made an effort and calculated the spectrograms 'my way' to re-run these experiments?

NickleDave commented 3 years ago

No, of course not. Can we chat briefly so I can understand what you're thinking is making a difference here?

NickleDave commented 3 years ago

I think I have found the source of the issue.

It's not actually any change within libraries like matplotlib.

It's because I changed the functions we use to load .wav files, and by default they return numpy arrays with different dtypes; before we were using scipy.io.wavfile.read which returns int16, but I am now using pysoundfile.read, that defaults to float64.

Obviously a time series with a bunch of small floating-point values is going to give you a very different range in your spectrogram compared to a time series with a bunch of ints.

(Pdb) dat_int, fs = constants.AUDIO_FORMAT_FUNC_MAP[audio_format]('/home/art/Documents/data/BirdsongRecognition/Bird0/Wave/0.wav', dtype='int16')
(Pdb) dat_float, fs = constants.AUDIO_FORMAT_FUNC_MAP[audio_format]('/home/art/Documents/data/BirdsongRecognition/Bird0/Wave/0.wav')
(Pdb) dat_int
array([ 1,  4,  5, ..., 20, 11, 12], dtype=int16)
(Pdb) dat_float
array([3.05175781e-05, 1.22070312e-04, 1.52587891e-04, ...,
       6.10351562e-04, 3.35693359e-04, 3.66210938e-04])
NickleDave commented 3 years ago

confirmed (for one file at least) that the output of spectrogram is the same for the current version of vak if I load the .wav file as int16 instead of float64

current version

(Pdb) dat_int, fs = constants.AUDIO_FORMAT_FUNC_MAP[audio_format]('/home/art/Documents/data/BirdsongRecognition/Bird0/Wave/0.wav', dtype='int16')
(Pdb) s_int, f, t = spectrogram(dat_int, fs, sp.fft_size, sp.step_size, sp.thresh, sp.transform_type, sp.freq_cutoffs)
(Pdb) s_int
array([[4.24147136e+00, 1.19410352e+00, 3.59396114e-02, ...,
        5.25278757e-01, 1.79924712e+00, 1.54650439e+00],
       [9.40141943e+00, 7.50662916e+00, 4.94522663e+00, ...,
        2.64810360e+00, 2.28606024e+00, 2.71039476e+00],
       [1.21478727e+00, 5.42120235e+00, 6.00196272e+00, ...,
        7.24186133e-01, 4.72530240e-01, 1.58501760e-02],
       ...,
       [4.43818634e-05, 4.12808592e-05, 6.35808880e-06, ...,
        1.60072882e-04, 3.17080914e-04, 4.36996110e-04],
       [5.90490805e-05, 7.57839587e-05, 1.14802833e-04, ...,
        6.69261682e-05, 7.79648788e-05, 1.51168512e-04],
       [1.63183132e-05, 9.31373812e-07, 1.83922333e-05, ...,
        8.74760214e-10, 3.23807428e-09, 6.82295401e-06]])

old version

Note the !s is to escape pdb interpreting s as a command not a variable name

(Pdb) audio_files[0]
'/home/bart/Documents/data/BirdsongRecognition/Bird0/Wave/0.wav'
(Pdb) fs, dat = AUDIO_FORMAT_FUNC_MAP[audio_format](audio_files[0])
(Pdb) !s, f, t = spectrogram(dat, fs, sp.fft_size, sp.step_size, sp.thresh, sp.transform_type, sp.freq_cutoffs)
array([[4.24147136e+00, 1.19410352e+00, 3.59396114e-02, ...,
        5.25278757e-01, 1.79924712e+00, 1.54650439e+00],
       [9.40141943e+00, 7.50662916e+00, 4.94522663e+00, ...,
        2.64810360e+00, 2.28606024e+00, 2.71039476e+00],
       [1.21478727e+00, 5.42120235e+00, 6.00196272e+00, ...,
        7.24186133e-01, 4.72530240e-01, 1.58501760e-02],
       ...,
       [4.43818634e-05, 4.12808592e-05, 6.35808880e-06, ...,
        1.60072882e-04, 3.17080914e-04, 4.36996110e-04],
       [5.90490805e-05, 7.57839587e-05, 1.14802833e-04, ...,
        6.69261682e-05, 7.79648788e-05, 1.51168512e-04],
       [1.63183132e-05, 9.31373812e-07, 1.83922333e-05, ...,
        8.74760214e-10, 3.23807428e-09, 6.82295401e-06]])
NickleDave commented 3 years ago

question is whether the values we get from the new version (when loading with float) can be transformed into the old version by just a scalar multiply

if so, then makes me think what really matters from the POV of the network is the absolute range of the values after normalization, i.e.

this makes sense if dividing by the std of a bunch of tiny floating point values to normalize gives us a gigantic range. Bet that's the explanation

NickleDave commented 3 years ago

Looks like yes, it's just a scalar multiply to go from old to new:

(Pdb) s_int
array([[4.24147136e+00, 1.19410352e+00, 3.59396114e-02, ...,
        5.25278757e-01, 1.79924712e+00, 1.54650439e+00],
       [9.40141943e+00, 7.50662916e+00, 4.94522663e+00, ...,
        2.64810360e+00, 2.28606024e+00, 2.71039476e+00],
       [1.21478727e+00, 5.42120235e+00, 6.00196272e+00, ...,
        7.24186133e-01, 4.72530240e-01, 1.58501760e-02],
       ...,
       [4.43818634e-05, 4.12808592e-05, 6.35808880e-06, ...,
        1.60072882e-04, 3.17080914e-04, 4.36996110e-04],
       [5.90490805e-05, 7.57839587e-05, 1.14802833e-04, ...,
        6.69261682e-05, 7.79648788e-05, 1.51168512e-04],
       [1.63183132e-05, 9.31373812e-07, 1.83922333e-05, ...,
        8.74760214e-10, 3.23807428e-09, 6.82295401e-06]])
(Pdb) s_float, f, t = spectrogram(dat_float, fs, sp.fft_size, sp.step_size, sp.thresh, sp.transform_type, sp.freq_cutoffs)
(Pdb) s_int / s_float
array([[1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09],
       [1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09],
       [1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09],
       ...,
       [1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09],
       [1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09],
       [1.07374182e+09, 1.07374182e+09, 1.07374182e+09, ...,
        1.07374182e+09, 1.07374182e+09, 1.07374182e+09]])
(Pdb) s_div = s_int / s_float
(Pdb) s_scalar = s_div[0, 0].item()
(Pdb) s_scalar
1073741824.0
(Pdb) s_scaled = s_float * s_scalar
(Pdb) s_scaled
array([[4.24147136e+00, 1.19410352e+00, 3.59396114e-02, ...,
        5.25278757e-01, 1.79924712e+00, 1.54650439e+00],
       [9.40141943e+00, 7.50662916e+00, 4.94522663e+00, ...,
        2.64810360e+00, 2.28606024e+00, 2.71039476e+00],
       [1.21478727e+00, 5.42120235e+00, 6.00196272e+00, ...,
        7.24186133e-01, 4.72530240e-01, 1.58501760e-02],
       ...,
       [4.43818634e-05, 4.12808592e-05, 6.35808880e-06, ...,
        1.60072882e-04, 3.17080914e-04, 4.36996110e-04],
       [5.90490805e-05, 7.57839587e-05, 1.14802833e-04, ...,
        6.69261682e-05, 7.79648788e-05, 1.51168512e-04],
       [1.63183132e-05, 9.31373812e-07, 1.83922333e-05, ...,
        8.74760214e-10, 3.23807428e-09, 6.82295401e-06]])
(Pdb) s_int
array([[4.24147136e+00, 1.19410352e+00, 3.59396114e-02, ...,
        5.25278757e-01, 1.79924712e+00, 1.54650439e+00],
       [9.40141943e+00, 7.50662916e+00, 4.94522663e+00, ...,
        2.64810360e+00, 2.28606024e+00, 2.71039476e+00],
       [1.21478727e+00, 5.42120235e+00, 6.00196272e+00, ...,
        7.24186133e-01, 4.72530240e-01, 1.58501760e-02],
       ...,
       [4.43818634e-05, 4.12808592e-05, 6.35808880e-06, ...,
        1.60072882e-04, 3.17080914e-04, 4.36996110e-04],
       [5.90490805e-05, 7.57839587e-05, 1.14802833e-04, ...,
        6.69261682e-05, 7.79648788e-05, 1.51168512e-04],
       [1.63183132e-05, 9.31373812e-07, 1.83922333e-05, ...,
        8.74760214e-10, 3.23807428e-09, 6.82295401e-06]])
(Pdb) np.allclose(s_int, s_scaled)
True
NickleDave commented 3 years ago

Not obvious to me what the difference is, just looking at descriptive stats of batches.

I mean, speaking mathematically, the "normalization" literally standardizes them so we expect samples to basically have a mean of zero and unit variance, regardless of what the original input range was. It shouldn't matter that we start from a different input range, especially if the spectrograms are just rescalings.

Below, x is a batch from the dataloader in the Model._train function. This means it will have already run through the transform that does the normalization in both cases.

They are randomly drawn batches so we don't expect the per-sample values to be exactly the same.

range of values, using peak-to-peak

old version:

>>> [np.ptp(x_sample.cpu().numpy()) for x_sample in x]
[3.5529435, 6.7687755, 7.797055, 5.5229335, 4.8032947, 5.087441, 49.022923, 8.137567]

current version:

>>> [np.ptp(x_sample.cpu().numpy()) for x_sample in x]
[6.317197, 4.4625735, 83.273865, 5.260035, 6.6298385, 12.167436, 7.636093, 8.63617]

Both look like they're about the same. Range between 1e0-1e1, except once in a while one that's 1e100.

absolute min and max values

Also don't see a huge difference here

old version:

>>> for x_sample in x:
...     x_sample_np = x_sample.cpu().numpy()
...     print(f'x min: {x_sample_np.min()}, x max: {x_sample_np.max()}')
...
x min: -0.9133289456367493, x max: 2.6396145820617676
x min: -0.9389304518699646, x max: 5.829844951629639
x min: -0.9394678473472595, x max: 6.857586860656738
x min: -0.928808331489563, x max: 4.594125270843506
x min: -0.9408183097839355, x max: 3.862476348876953
x min: -0.9346219301223755, x max: 4.1528191566467285
x min: -0.9237425327301025, x max: 48.099178314208984
x min: -0.9410564303398132, x max: 7.196509838104248

current version:

>>> for x_sample in x:
...     x_sample_np = x_sample.cpu().numpy()
...     print(f'x min: {x_sample_np.min()}, x max: {x_sample_np.max()}')
...
x min: -1.0686646699905396, x max: 5.248532295227051
x min: -1.0715813636779785, x max: 3.3909921646118164
x min: -1.0613844394683838, x max: 82.21247863769531
x min: -1.0460361242294312, x max: 4.213998794555664
x min: -1.0682597160339355, x max: 5.561578750610352
x min: -1.0674132108688354, x max: 11.100022315979004
x min: -1.0691827535629272, x max: 6.566910266876221
x min: -1.0617287158966064, x max: 7.574441432952881

mean, std, median

old version:

>>> for x_sample in x:
...     x_sample_np = x_sample.cpu().numpy()
...     print(f'x mean: {x_sample_np.mean()}, x std: {x_sample_np.std()}, x median: {np.median(x_sample_np)}')
...
x mean: -0.10917782038450241, x std: 0.10606592893600464, x median: -0.09260101616382599
x mean: -0.09346791356801987, x std: 0.17985652387142181, x median: -0.08875711262226105
x mean: -0.08290764689445496, x std: 0.24375955760478973, x median: -0.08979134261608124
x mean: -0.10684436559677124, x std: 0.14660131931304932, x median: -0.09389768540859222
x mean: -0.10449446737766266, x std: 0.1550692766904831, x median: -0.09473021328449249
x mean: -0.09199614822864532, x std: 0.18542729318141937, x median: -0.08992025256156921
x mean: 0.20172898471355438, x std: 1.7558399438858032, x median: -0.07527005672454834
x mean: -0.0985182523727417, x std: 0.18250897526741028, x median: -0.09058909118175507

current version:

>>> for x_sample in x:
...     x_sample_np = x_sample.cpu().numpy()
...     print(f'x mean: {x_sample_np.mean()}, x std: {x_sample_np.std()}, x median: {np.median(x_sample_np)}')
...
x mean: -0.07098879665136337, x std: 0.22726330161094666, x median: -0.08440127968788147
x mean: -0.09667869657278061, x std: 0.16204503178596497, x median: -0.09410282224416733
x mean: 0.15948288142681122, x std: 1.9239710569381714, x median: -0.07546369731426239
x mean: -0.10425417125225067, x std: 0.16331957280635834, x median: -0.09719042479991913
x mean: -0.08318854868412018, x std: 0.23316417634487152, x median: -0.09144178032875061
x mean: -0.02177470549941063, x std: 0.40939199924468994, x median: -0.07989080995321274
x mean: -0.08480856567621231, x std: 0.24174930155277252, x median: -0.09146381914615631
x mean: -0.06952553987503052, x std: 0.3161887526512146, x median: -0.08758103847503662
NickleDave commented 3 years ago

@yardencsGitHub I tried just hacking the current version so that it loads .wav files as int16 and then re-running.

In this case I still get the higher segment error rate. This makes me afraid that it's actually something about torch that's causing the difference.

After making the hack, I verified by inspection that spectrograms were made with the larger range like the old version did.

I also checked that the torch.Tensor.dtype is the same for both, float32. It is.

yardencsGitHub commented 3 years ago

I am currently re-running one canary experiment and another for Bird0 from the 'recognition' dataset. If the canary experiment also gets higher segmentation errors it will confirm that some thing in newer versions (or torch or others) is different.

NickleDave commented 3 years ago

I see that Adam parameters are exactly the same between versions.

Next suspect is the functions computing segment error rate then.

NickleDave commented 3 years ago

ok @yardencsGitHub I think I have tracked down the source of the discrepancy

the bad new is: it was due to a bug we had before

I failed to realize at the time that making this fix would have an impact on how we computed the segment error rate -- I simply made this change to has_unlabeled because it was throwing an error when trying to operate on a row vector (as Samuel Brudner noticed in https://github.com/NickleDave/vak/issues/239).

As you can see from my commit message, I was fixated on other errors resulting from the same problem with lb_tb2Iabels (the function was throwing a TypeError apparently?) so I made a similar fix there. But I didn't connect the dots between that and the segment error rate.

in plain language, what was happening:

We can discuss further face-to-face but want to document here

NickleDave commented 3 years ago

Closing this; nothing to address. We will report correct error rates as now computed by vak.

Even after the fix described above, there was a minor bug that may in some rare cases have affected some calculations, as describedin https://github.com/NickleDave/vak/issues/355 , but that was fixed by https://github.com/NickleDave/vak/commit/c76ee5d91ba5a697d98e659226314ac3c94ea31f and is now covered by unit tests in vak