ws-choi / Conditioned-Source-Separation-LaSAFT

A PyTorch implementation of the paper: "LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation" (ICASSP 2021)
MIT License
85 stars 18 forks source link

WHY DOES LASAFT ONLY USE A CPU INSTEAD OF A GPU? #10

Closed lucasdobr15 closed 3 years ago

lucasdobr15 commented 3 years ago

Hello

I wonder if LASAFT is able to separate music using GPU? (I'm not referring to model training)

Because I already did everything here, I installed CUDA, cuDNN, TensorFlow-GPU and even so LASAFT insists on using only CPU to separate the songs :(

I await feedback, thank you very much: D

Screenshot_1 Screenshot_2

ws-choi commented 3 years ago

Hi @lucasdobr15, which function have you called for separation? Did you call model.separate_track like below?

from lasaft.pretrained import PreTrainedLaSAFTNet
model = PreTrainedLaSAFTNet(model_name='lasaft_large_2020')
vocals = model.separate_track(audio, 'vocals') 
drums = model.separate_track(audio, 'drums') 
bass = model.separate_track(audio, 'bass') 
other = model.separate_track(audio, 'other')

Then please check you set the pretrained model to be in the cuda mode.

model = model.cuda()

If it does not work, then please share the code script you have used.

lucasdobr15 commented 3 years ago

This is the source code of the script: 4stems.py

!/usr/bin/python

import os import numpy as np import soundfile as sf from lasaft.pretrained import PreTrainedLaSAFTNet

model = PreTrainedLaSAFTNet(model_name='lasaft_large_2021')

audio, fs = sf.read('test.wav') number_samples, number_channels = np.shape(audio)

.# audio should be an np(numpy) array of an stereo audio track .# with dtype of float32 .# shape must be (T, 2)

vocals2 = model.separate_track(audio, 'vocals') vocals = vocals2[0:number_samples]

drums2 = model.separate_track(audio, 'drums') drums = drums2[0:number_samples]

bass2 = model.separate_track(audio, 'bass') bass = bass2[0:number_samples]

other2 = model.separate_track(audio, 'other') other = other2[0:number_samples]

sf.write('test_vocals.wav', vocals, fs) sf.write('test_drums.wav', drums, fs) sf.write('test_bass.wav', bass, fs) sf.write('test_other.wav', other, fs)

os.remove("temp.wav")

What do you suggest?

ws-choi commented 3 years ago

please try below

model = PreTrainedLaSAFTNet(model_name='lasaft_large_2021').cuda()
lucasdobr15 commented 3 years ago

Thank you so much I just added this code and it worked <3 Screenshot_2

We just have to thank you for having given us this beautiful project!

A doubt

Why does the end result have "lags" in the songs?

ws-choi commented 3 years ago

You welcome :) and, what kind of lags do you mean? Can you share some sample?

lucasdobr15 commented 3 years ago

ORIGINAL (SAMPLE 37 SECONDS) https://www.youtube.com/watch?v=9TNyueKk2Nw LASAFT (SAMPLE 37 SECONDS) https://www.youtube.com/watch?v=wFB3SR29WTI

why that kind of problem? :(

ws-choi commented 3 years ago

Thank you for sharing :) Is "lag" you mentioned something like 0:16~0:18 in https://www.youtube.com/watch?v=wFB3SR29WTI ?

lucasdobr15 commented 3 years ago

Yes, exactly

It's this kind of 'lag' I'm talking about

And another thing LASAFT thinks these instruments are voices :/ (GUITAR SOLO / FLUTE / BLOWING INSTRUMENTS AND MAINLY (ORGAN INSTRUMENT)

Do you have any suggestions to correct these 2 problems?

ws-choi commented 3 years ago

Wow, your analysis is perfect. We also have suspected that our models are hard to distinguish singing voice from instruments you mentioned. It might be because the training dataset (MUSDB18) only contains four groups of instruments (vocals, drum, bass, and other), so there are no explicit cases to make the model distinguish them from singing voice. Possible solutions might be quite technical, and we've been designing solutions for it.

Below is the list of possible solutions

We have been working on this new project and will publish it after submitting a new paper if it produces a better result :)