Open sonovice opened 4 months ago
Thank you for your interest in our work. I share your thoughts on this matter.
In our experiments, we found that our method performs exceptionally well in restoring harmonic structures, which are very prominent in music. Therefore, I also believe that our method should be well-suited for music enhancement.
For a 48 kHz sampling rate, increasing the FFT window size and hop size is feasible, but our phase prediction method is quite sensitive to the hop size. My main concern is that a larger hop size might lead to a decline in performance. The most suitable hop size would need to be determined based on specific experimental results.
Since I am no longer working on speech enhancement, I would appreciate it if you could share any progress you make.
Thank you for open sourcing your research, highly appreciated!
After thoroughly reading the paper I have the feeling that the entire approach could be almost readily used for not only speech enhancement but also music enhancement, e.g. from smartphone recordings. The FFT-window and hop size would need to be increased to make sense for 48 KHz and maybe also the length of the audio snippets to capture more context. Also, PESQ and other metrics are geared towards speech so I would rather look for something like FAD.
Do you see any other obstacles or maybe some valid reasons why this might be a bad idea?