ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.21k stars 10.03k forks source link

ecchi.iwara.tv metadata uploader being NA #24237

Open mo-han opened 4 years ago

mo-han commented 4 years ago

Checklist

Verbose log

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-f', '(mp4)[height<=1080][fps<=60]+(m4a/aac)/bestvideo+bestaudio/best', '--proxy=http://127.0.0.1:7777', '--external-downloader', 'aria2c', '--external-downloader-args', '-x5 -s5 -k 1M --file-allocation=trunc', '-o', '%(title)s [%(id)s][%(uploader)s].%(ext)s', '--yes-playlist', 'https://ecchi.iwara.tv/videos/z0y6puxqzqfkm7vre', '--no-check-certificate', '-v']
[debug] Encodings: locale cp936, fs utf-8, out utf-8, pref cp936
[debug] youtube-dl version 2020.03.01
[debug] Python version 3.6.6 (CPython) - Windows-10-10.0.10240-SP0
[debug] exe versions: ffmpeg 4.0.2, ffprobe 4.0.2, phantomjs 2.1.1
[debug] Proxy map: {'http': 'http://127.0.0.1:7777', 'https': 'http://127.0.0.1:7777'}
[Iwara] z0y6puxqzqfkm7vre: Downloading webpage
[Iwara] z0y6puxqzqfkm7vre: Downloading JSON metadata
[debug] Invoking downloader on 'https://ling.iwara.tv/file.php?expire=1583332380&hash=14947e72f69e3f85d6cfdd589ebde21cdc20ee0d&file=2019%2F03%2F12%2F1552384452_z0y6puXQZQFkm7vRE_Source.mp4&op=dl&r=0'
[download] 疑心暗鬼 [z0y6puxqzqfkm7vre][NA].mp4 has already been downloaded
[download] 100% of 105.55MiB

Description

Videos are downloaded successfully, but %(uploader)s is always replaced by NA (it's part of my wanted filename format).

mo-han commented 4 years ago

Dammit I cannot wait anymore so just wrote a tiny python script to get uploader and rename downloaded mp4 files, which turned out the iwara web page can be simply parsed using lxml library and the uploader is so easy to be extractd. I refused to create a pull-request though because such piece of cake should be done without difficulty at all.

Here is my self-using script, anyway.

#!/usr/bin/env python3
# encoding=utf8
import sys
from urllib.parse import urlparse
from lxml import html
from requests import get
from glob import glob
from os.path import split, splitext, join
from os import rename

class IwaraVideo:
    def __init__(self, url: str):
        self.urlparse = urlparse(url)
        if 'iwara' not in self.urlparse.hostname:
            raise ValueError(url)
        elif 'video' not in self.urlparse.path:
            raise ValueError(url)
        self.url = url
        self.html = None
        self.meta = {
            'id': self.urlparse.path.split('/')[-1],
        }

    def get_page(self):
        if not self.html:
            r = get(self.url)
            self.html = html.document_fromstring(r.text)
        return self.html

    def get_uploader(self):
        video_page = self.get_page()
        uploader = video_page.xpath('//div[@class="node-info"]//div[@class="submitted"]//a[@class="username"]')[0].text
        self.meta['uploader'] = uploader
        return uploader

    def find_files_by_id(self, search_in=''):
        id_tag = '[{}]'.format(self.meta['id'])
        self.meta['id_tag'] = id_tag
        mp4_l = glob(search_in + '*.mp4')
        r_l = []
        for i in mp4_l:
            if id_tag in i:
                r_l.append(i)
        return r_l

    def rename_files_from_ytdl_na_to_uploader(self, search_in=''):
        na_tag = '[NA]'
        path_l = self.find_files_by_id(search_in=search_in)
        id_tag = self.meta['id_tag']
        uploader = self.get_uploader()
        up_tag = '[{}]'.format(uploader)
        for p in path_l:
            dirname, basename = split(p)
            filename, extension = splitext(basename)
            if na_tag in filename:
                left, right = filename.split(id_tag, maxsplit=1)
                right = right.replace(na_tag, up_tag, 1)
                new_basename = left + id_tag + right + extension
                new_path = join(dirname, new_basename)
                rename(p, new_path)

if __name__ == '__main__':
    u = sys.argv[1]
    video = IwaraVideo(u)
    video.rename_files_from_ytdl_na_to_uploader()
mo-han commented 4 years ago

monkey patch version:

class YoutubeDLIwaraX(youtube_dl.extractor.iwara.IwaraIE, metaclass=ABCMeta):
    def _real_extract(self, url):
        html = get_html_element_tree(url)
        uploader = html.xpath('//div[@class="node-info"]//div[@class="submitted"]//a[@class="username"]')[0].text
        data = super(YoutubeDLIwaraX, self)._real_extract(url)
        data['uploader'] = uploader
        # print('#', 'uploader:', uploader)
        return data

def youtube_dl_main_x_iwara(argv=None):
    youtube_dl.extractor.IwaraIE = YoutubeDLIwaraX
    youtube_dl.main(argv)
ZYinMD commented 4 years ago

Hi mo-han, in case this issue is never fixed, could you explain to a non-python-programmer how to use your code?

mo-han commented 4 years ago

@ZYinMD Refer to my self-using module as an example, which is really simple, with the youtube_dl_main_x_iwara as a modified main() function of the original youtube-dl. Just call this function and everything works the same as the original, except iwara extractor has uploader data now.

ZYinMD commented 4 years ago

Thanks! I'll try... By the way, since youtube-dl doesn't support downloading "channels" on iwara, how do you download all videos from one uploader? Do you write your own crawler? I know it's quite easy, but just wondering.

mo-han commented 4 years ago

I don't have "channel" extractor (nor ytdl), which is not "quite easy" for me -- it needs to check "private flag" of the videos and do some "next page" actions on the "all videos" result page, etc. I didn't try to do that, and there definitely will be a lot of problems and work to achieve that.

As for your demand: batch downloading from iwara.tv or similar -- I do have a solution, not fully automated, but still saving a lot of copy-paste and mouse-click operations.

First we need to get the URLs of the selected videos. I don't use Chrome, but Firefox has a feature called "View Selection Source". When anything is selected (or selete everything by ctrl+a), there will be that feature in the right-click context menu, which will bring you to a new tab containing the source code of the page and the parts corresponding to the selection will be auto selected for you. So we can just use mouse to select multiple videos (their thumbnails on the web page) or just select all elements on the page, then choose View Selection Source in the right-click-menu, then copy (ctrl+c) the selected source code into clipboard, and move on.

Secondly, we need to find all the video URLs in the clipboard. While a lot of tools and methods could be used to do this job, I do write my own tool, called mykit.py. It's a CLI program, with a lot of sub-commands, among which is a command called clipboard.findurl or cb.url or cburl, same command, just several aliases. This cburl sub-command will extract text strings from a file or the clipboard, by a given pattern. The pattren is a regex, but we don't need to write our own because iwara's video URL pattern is already one of the presets. So, a simple command mykit cburl iwara will find all of the video URLs from the source code in the clipboard, and print them out line by line, meanwhile, the results are also copied back to clipboard.

Finally, just use those lines of URLs as argument to launch multiple download processes. We could save those URL lines into a file, using shell script to read them out and run youtube-dl (or the modifed version) with each line. Again, the mykit.py could give a hand, with a sub-command called run.from.lines or runlines or rl, which reads lines from a file or from the clipboard, and run a command format template with each line. What I do is typing a single command mykit.py rl ytdl {}, and it will read those URL lines in clipboard and run a command as ytdl {url} for each.

Not very automatic, but convinient enought, isn't it?

Or you could wirte you own "channel" extractor, if it's worth it.

ZYinMD commented 4 years ago

Thanks so much!! I read all the code in those script files you mentioned, and they make perfect sense. As a python-noob and powershell-noob I still have questions about installing, I think I could open issues in your repo. Thanks and see you there!

mo-han commented 4 years ago

@ZYinMD That's fine, me a half python noob and a total ps noob as well.