williamyang1991 / VToonify

[SIGGRAPH Asia 2022] VToonify: Controllable High-Resolution Portrait Video Style Transfer
Other
3.53k stars 442 forks source link

Video Toonification #32

Closed ss8319 closed 1 year ago

ss8319 commented 1 year ago

Hi. I am working on the code in the Colab Notebook in the repo, on PART II - Style Transfer with specialized VToonify-D model.

I am working through all the steps just fine but when I am at the Video Toonification code, I am able to go through the 'Visualize and Rescale Input' part fine but I cant run 'Perform Inference'. Running the code works well for the default input video, but when I am using my own video it's creating problems.

Running this: ` with torch.no_grad(): batch_frames = [] print(num) for i in tqdm(range(num)): if i == 0:
I = align_face(frame, landmarkpredictor) I = transform(I).unsqueeze(dim=0).to(device) s_w = pspencoder(I) s_w = vtoonify.zplus2wplus(s_w) s_w[:,:7] = exstyle[:,:7] else: success, frame = video_cap.read() frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) if scale <= 0.75: frame = cv2.sepFilter2D(frame, -1, kernel_1d, kernel_1d) if scale <= 0.375: frame = cv2.sepFilter2D(frame, -1, kernel_1d, kernel_1d) frame = cv2.resize(frame, (w, h))[top:bottom, left:right]

    batch_frames += [transform(frame).unsqueeze(dim=0).to(device)]

    if len(batch_frames) == batch_size or (i+1) == num:
        x = torch.cat(batch_frames, dim=0)
        batch_frames = []
        # parsing network works best on 512x512 images, so we predict parsing maps on upsmapled frames
        # followed by downsampling the parsing maps
        x_p = F.interpolate(parsingpredictor(2*(F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=False)))[0], 
                        scale_factor=0.5, recompute_scale_factor=False).detach()
        # we give parsing maps lower weight (1/16)
        inputs = torch.cat((x, x_p/16.), dim=1)
        # d_s has no effect when backbone is toonify
        y_tilde = vtoonify(inputs, s_w.repeat(inputs.size(0), 1, 1), d_s = 0.5)       
        y_tilde = torch.clamp(y_tilde, -1, 1)
        for k in range(y_tilde.size(0)):
            videoWriter.write(tensor2cv2(y_tilde[k].cpu()))

videoWriter.release() video_cap.release() `

Gives: 0it [00:00, ?it/s]

ss8319 commented 1 year ago

Looking deeper into the code. This part is causing problems. num is some int value of -279496122328932608. Not sure where that is from. Where does the value of num originate from? The code shows:

with torch.no_grad():
    batch_frames = []
    for i in tqdm(range(num)):
        if i == 0:    

Therefore, the program doesn't enter the logic after the for loop, resulting in the error.