Improve migration from v1 to v2

woefe commented 4 years ago

Migrating the entire v1 database schema is probably too much effort. Instead, it might be enough to export subscriptions with v1 and import the OPML export in v2. Problems with this first alternative:

User must have both v1 and v2 installed (or make an export before upgrading to v2)
v2 database must be at different location from v1 (that's not the case in the default config)

Second alternative: A python script (preferably without dependencies) that renames the v1 db, exports subscriptions, and re-imports them with v2. This would solve the problems of the first alternative, but is more work.

I'd love to see everyone's thoughts on this.

woefe commented 4 years ago

First approach depends on #43

woefe commented 4 years ago

First approach has been implemented in 1da677b and 0d03bac

woefe commented 4 years ago

I'm going to leave this issue open for now. Hoping to get more feedback.

EmRowlands commented 4 years ago

I am currently working on a (pair of) script(s) that will perform the migration in a somewhat automated manner. This will include all videos, if they're watched and anything that's in the version 1 database. The first script is run with version 1 installed, and the second with v2 due to python not being able to handle importing two modules with the same name simultaneously.

EmRowlands commented 4 years ago

I'm currently trying to work out how to calculate the extractor_hash from a video URL, but I'm not getting very far. Could you offer some help on this? This is the only thing (I think) holding back the entire script

woefe commented 4 years ago

@EmRowlands Awesome!!

The extractor_hash is calculated as a sha256 of youtube-dl's (unprocessed) information extractor output.

Basically pseudocode for one item:

sha256(YoutubeDL(...).extract_info(..., process=False).entries[0])

Relevant lines: https://github.com/woefe/ytcc/blob/d97eebdab0440b04028dadb5ed3f09eaf3eb2edd/ytcc/core.py#L50 https://github.com/woefe/ytcc/blob/d97eebdab0440b04028dadb5ed3f09eaf3eb2edd/ytcc/core.py#L84

woefe commented 4 years ago

And important here is that the hashed entry is from a playlist, not from the video page itself. Not sure, if we can reverse it from a yt_video_id easily, which is probably what we would need when converting from v1?

EmRowlands commented 4 years ago

I knew how it was being created, I just couldn't reproduce it because I didn't have access to the playlist data. I'm also not sure this way of generating the hash makes sense, since if a video is in multiple playlists it will have multiple extractor_hashes (unless this is intentional). I considered suggesting using the same method, but using the extractor info for the specific video instead, but that would require youtube-dling the info for every single video that is being imported.

Perhaps it would be better to use something like the format provided by --download-archive, which provides strings that look like this:

youtube dQw4w9WgXcQ

Where the first part is the name of the extractor, and the second is a site-specific string that uniquely identifies a video.

woefe commented 4 years ago

Admittedly, the extractor_hash approach has problems. I actually found cases where it won't work with the current function that relies on a Dict[str, str], which is not always the output of processors. Sometimes the values might be more complex structures.

The hash should be the same for videos of different playlists. At least, for all examples that I checked it was the same.

I have looked into the --download-archive option again. It uses _make_archive_id to create the id, which can be generated from an unprocessed result and therefore does not require more network requests than the current approach. I think using _make_archive_id is more reliable, because then we rely on existing youtube_dl internals, which should work nicer with the rest of youtube_dl.

It is possible to replace the extractor_hash() with _make_archive_id(). Ytcc will simply resync all playlist content on the next update. I'll commit my changes and release a second beta soon.

EmRowlands commented 4 years ago

I've done some testing, and it appears that this approach will work with my scripts in a drop-in way. Since all of the videos from v1 will be from youtube, it's trivial to reimplement _make_archive_id() to not require network access. I'm also not sure how to contribute these scripts, since they require v1 to be installed, with v2 code sitting in a different directory (or vice-versa for the price of a trivial change)

EmRowlands commented 4 years ago

I have finished my implementation, but there are some caveats:

It requires version 1 to be installed
It requires a copy of the source code of version 2
It exists entirely "out of tree"

As such, I'm not really sure how to submit it for review. It could sit in the scripts directory if it was only a single file, but it includes 5 files (a common file, an export and import script, a config file, and a migration shell script which runs them all).

If you think it would be acceptable to put them in a subdirectory of scripts, I'm happy to submit a PR.

woefe commented 4 years ago

@EmRowlands, Im not sure how to handle it. Is it public somewhere for me to see? Can you maybe push it to a new branch on your fork in a new subfolder of scripts/? Then we can still decide where to put it when we merge it. Maybe, we create an orphan branch (git checkout --orphan ...).

EmRowlands commented 4 years ago

I have added them in #50

woefe / ytcc

Improve migration from v1 to v2 #42