Reactivating Removed Comics?

webcomics / dosage

dosage is a comic strip downloader and archiver

https://dosage.rocks/

MIT License

125 stars 59 forks source link

Reactivating Removed Comics? #114

Closed DirkReiners closed 4 years ago

DirkReiners commented 6 years ago

There is a (long) list of removed comics in plugins/old.py. Many of them don't have removal reasons, and the default is 'del' (deleted). But some of them are still alive and kicking.

Is there a process to get them back, or can we can write modules for them and remove them from old.py?

Thanks!

TobiX commented 6 years ago

I would suggest blaming plugins/old.py and finding the commit that removed the module. That could get you more insight into why the module was removed. Most comics will probably land you on df2048c, which was the commit that introduced the old module. Comics removed before that have probably no reason attached, since I added them in bulk. If the reason is wrong or the comic is alive (again), feel free to create pull requests reviving the module and removing it from the old list.

Efreak commented 6 years ago

Is there any reason not to update them to work with the web archive? Many comics that are 'gone' from the web are archived...

TobiX commented 4 years ago

@Efreak Feel free. See for example my recent commit https://github.com/webcomics/dosage/commit/752525c3e98f36b9376b6aa919e5d0567b8e1be8 which does that for a bunch of comics. Some are archived completely, some might be missing some pages. If many/most pages were missing, I removed the modules instead.

@DirkReiners The reason is often del because I didn't bother to research the specific reason why something was removed. Probably should have used a better description.

PS: I'm closing this since I don't see any specific issue here. Feel free to open new issues for specific comics or reopen this issue if you still have open questions regarding this.

Efreak commented 4 years ago

I'll take a look at it when I start reading webcomics again (rn I'm catching up on webnovels). For the most part, it should be trivial (start with last archived comic, work backwards), but it would require testing each comic to weed out comics like AIAC that seem to be archived, but only have a few pages actually available, and comics that aren't actually in the web archive at all.

If you're looking for something that would work now, my previous solution for similar software was using the hosts file to redirect the dead website to a local nginx, which was configured to prefix all requests with 'https://web.archive.org/web/' in a redirect. I'm not sure if the web archive redirect is only in js, though; if it is then this wouldn't work.

TobiX commented 4 years ago

@Efreak Actually, archive.org is using HTTP redirects: When you access https://web.archive.org/web/http://www.example.com/ you are redirected to https://web.archive.org/web/20200110103234/https://example.com/, but therein lies the first problems: Many older comics which have been around for some years are now occupied by domain-squatters, so the latest archive.org copy is also broken. Finding useful snapshots is most of the work when using archive.org for a comic...

Efreak commented 4 years ago

Web archive doesn't require exact dates of archived versions in the URL. You do have to find a date that works, but once you have the most recently archived page for that site, that date should just work for as much of the site they have available. To find an appropriate page/date to use, check https://web.archive.org/web/*/domain.tld; frequently there's a long period of time on older sites between the last good copy and the first archived domain squatter[1]. The size of the circles on the dates is a useful indicator as well; the first archived squatter is likely to have a larger circle (iirc? I might be wrong).

1: possibly this is related to the grace period for renewing domains? Or maybe squatters just used to be slower at grabbing expired domains