Closed woefe closed 4 years ago
First approach depends on #43
First approach has been implemented in 1da677b and 0d03bac
I'm going to leave this issue open for now. Hoping to get more feedback.
I am currently working on a (pair of) script(s) that will perform the migration in a somewhat automated manner. This will include all videos, if they're watched and anything that's in the version 1 database. The first script is run with version 1 installed, and the second with v2 due to python not being able to handle importing two modules with the same name simultaneously.
I'm currently trying to work out how to calculate the extractor_hash
from a video URL, but I'm not getting very far. Could you offer some help on this? This is the only thing (I think) holding back the entire script
@EmRowlands Awesome!!
The extractor_hash is calculated as a sha256 of youtube-dl's (unprocessed) information extractor output.
Basically pseudocode for one item:
sha256(YoutubeDL(...).extract_info(..., process=False).entries[0])
Relevant lines: https://github.com/woefe/ytcc/blob/d97eebdab0440b04028dadb5ed3f09eaf3eb2edd/ytcc/core.py#L50 https://github.com/woefe/ytcc/blob/d97eebdab0440b04028dadb5ed3f09eaf3eb2edd/ytcc/core.py#L84
And important here is that the hashed entry is from a playlist, not from the video page itself. Not sure, if we can reverse it from a yt_video_id easily, which is probably what we would need when converting from v1?
I knew how it was being created, I just couldn't reproduce it because I didn't have access to the playlist data. I'm also not sure this way of generating the hash makes sense, since if a video is in multiple playlists it will have multiple extractor_hash
es (unless this is intentional). I considered suggesting using the same method, but using the extractor info for the specific video instead, but that would require youtube-dl
ing the info for every single video that is being imported.
Perhaps it would be better to use something like the format provided by --download-archive
, which provides strings that look like this:
youtube dQw4w9WgXcQ
Where the first part is the name of the extractor, and the second is a site-specific string that uniquely identifies a video.
Admittedly, the extractor_hash
approach has problems. I actually found cases where it won't work with the current function that relies on a Dict[str, str], which is not always the output of processors. Sometimes the values might be more complex structures.
The hash should be the same for videos of different playlists. At least, for all examples that I checked it was the same.
I have looked into the --download-archive option again. It uses _make_archive_id to create the id, which can be generated from an unprocessed result and therefore does not require more network requests than the current approach. I think using _make_archive_id is more reliable, because then we rely on existing youtube_dl internals, which should work nicer with the rest of youtube_dl.
It is possible to replace the extractor_hash()
with _make_archive_id()
. Ytcc will simply resync all playlist content on the next update. I'll commit my changes and release a second beta soon.
I've done some testing, and it appears that this approach will work with my scripts in a drop-in way. Since all of the videos from v1 will be from youtube, it's trivial to reimplement _make_archive_id()
to not require network access. I'm also not sure how to contribute these scripts, since they require v1 to be installed, with v2 code sitting in a different directory (or vice-versa for the price of a trivial change)
I have finished my implementation, but there are some caveats:
As such, I'm not really sure how to submit it for review. It could sit in the scripts
directory if it was only a single file, but it includes 5 files (a common file, an export and import script, a config file, and a migration shell script which runs them all).
If you think it would be acceptable to put them in a subdirectory of scripts
, I'm happy to submit a PR.
@EmRowlands, Im not sure how to handle it. Is it public somewhere for me to see? Can you maybe push it to a new branch on your fork in a new subfolder of scripts/
? Then we can still decide where to put it when we merge it. Maybe, we create an orphan branch (git checkout --orphan ...
).
I have added them in #50
Migrating the entire v1 database schema is probably too much effort. Instead, it might be enough to export subscriptions with v1 and import the OPML export in v2. Problems with this first alternative:
Second alternative: A python script (preferably without dependencies) that renames the v1 db, exports subscriptions, and re-imports them with v2. This would solve the problems of the first alternative, but is more work.
I'd love to see everyone's thoughts on this.