webcomics / dosage

dosage is a comic strip downloader and archiver
https://dosage.rocks/
MIT License
122 stars 59 forks source link

KingFeatures (ComicsKingdom.com) changed format #131

Closed littauer closed 5 years ago

littauer commented 5 years ago

King Features (aka ComicsKingdom.com) has changed their format and I’m having trouble getting them set back up.

I don’t know python or the code base well enough to fix this but will do the grunt work if someone can point the way. I’d prefer a generic answer like those for Creators or GoComics but will take what I can get.

There aren’t a lot of comics and they’re mostly antiques (Rex Morgan MD, Judge Parker, The Phantom, Alley Oop) but there are a few that people here follow:

Sally Forth, Sherman’s Lagoon, Hagar the Horrible, Safe Havens, and On the Fastrack all are primarily there now. Kevin and Kell is probably headed that way as well.

The new format:

url is of the form https://comicskingdom.com/

prior strips (no more than a week or so back so far as I can tell) are at: https://comicskingdom.com//yyyy-mm-dd

index formats used to vary (mostly Month-dd-yyyy) but are now yyyy-mm-dd

I can’t find a “previous” link

imageSearch = compile(r' image-url="(https://safr\.kingfeatures\.com/api/img\.php\?e=png&s=c&file=.+)"')

gives 1 image.

There is no indication as to the image’s size or type even though Firefox correctly gets the size and type (PNG). This causes trouble on download.

Sample : safe-havens, on-the-fast-rack, sherman-s-lagoon, sally-forth, hagar-the-horrible, rex-morgan-md

Thanks for any help,

Tom

andrew-healey commented 5 years ago

The "previous" link cannot be found with the regex in your post. That finds the URL of the comic, not the link to the previous day. The "previous" link is in the source with only the date (yyyy-mm-dd), in HTML as such (example from Barney and Google from April 21, 2019): <slider-arrow inline-template :is-left-arrow="true" feature-slug="barney-google-and-snuffy-smith" date-slug="2019-04-20"> I hope this answers your question of how to find the link to the previous date's comic.

EDIT:

Is ComicsKingdom actually in the plugins folder at all?

littauer commented 5 years ago

Thanks for the "previous" link solution, I'd missed it and it will be helpful.

The big problem is that the image pointed to by the imageSearch does not have a Content-Length header and therefore gets written with zero length. It also doesn't have an assigned type but I can jam that to .png.

You're correct, ComicsKingdom isn't in the plugins or scripts folders; that's what I'm trying to fix.

EDIT: Doesn't prevSearch assume that it can return a URL? I can return a strip name and date from what you pointed me to but the base URL (https://comicskingdom.com/) isn't there to return.

andrew-healey commented 5 years ago

I'm guessing that, based on that shortcoming, a pull request is needed - just about find and replace with regex.

littauer commented 5 years ago

The following pattern seems to work for immediate scraping but be aware that you can only go back about 7 days or so in the past. I'm using OnTheFastrack as an example.

DRAT! how do you paste python code here? The indents are being eaten!

class OnTheFastrack(_BasicScraper):
    # King Features seems to have changed format on 4/09/2019 
    url = 'https://comicskingdom.com/on-the-fastrack/'
    stripUrl = url + '%s'
    firstStripUrl = stripUrl % '2000-11-13'
    imageSearch = compile(r' image-url="(https://safr\.kingfeatures\.com/api/img\.php\?e=png&amp;s=c&amp;file=[^"]+)"')
    prevSearch = compile(r' :is-left-arrow="true" .*date-slug="(\d\d\d\d-\d\d-\d\d)"')
    help = 'Index format: yyyy-mm-dd'
    def namer(self, image_url, page_url):
        name = page_url.rsplit('/', 3)[2]
        date = page_url.rsplit('/', 3)[3]
        if date == "":
             import datetime
             date = datetime.date.today().strftime("%Y-%m-%d")
        return "%s_%s.png" % (name.title(), date)
    def link_modifier(self, url, tourl):
        urllen = len(self.url)
        if tourl[:urllen] != self.url:
            tourl = self.url + tourl
        return tourl
littauer commented 5 years ago

If you're desperate for ComicsKingdom.com strips, put comicskingdom.py in your dosagelib/plugins directory and reference the strips by ComicsKingdom/\<strip> as listed in it.

A few don't work (like Tiger vs TigerSundays) due to an issue in scripts/scraper.py.

I'll be asking about that in a separate issue.

You can find comicskingdom.py at:

https://github.com/littauer/dosagetest/blob/master/dosagelib/plugins/comicskingdom.py

littauer commented 5 years ago

Closed as fixed in pull request # 134