webcomics / dosage

dosage is a comic strip downloader and archiver
https://dosage.rocks/
MIT License
125 stars 59 forks source link

Many deleted comics due to ComicsKingdom butchering #132

Closed andrew-healey closed 4 years ago

andrew-healey commented 5 years ago

Since ComicsKingdom now has Sherman's Lagoon, Dennis The Menace and many other popular/old/generally good comics' websites redirecting to the Comics Kingdom website, Sherman's Lagoon and other comic sources are now broken on this. The only choice is to switch to an alternative website (For example, Hagar the Horrible uses a privately-owned hagarthehorrible.net (or hagardunor.*) site, instead of ComicsKingdom-owned hagarthehorrible.com) that will probably stay up for the forseeable future or to use Comics Kingdom's site to scrape.

littauer commented 5 years ago

Another choice is to scrape the new ComicsKingdom.com website. I have individual comics working and am working on generic scraping.

littauer commented 5 years ago

Fixed in pull request # 134

dabolter commented 5 years ago

New to github, so not certain I did this correctly, tried to grab the changed files, it appeared to work, but it tries to download something and gets an SSLError

I did have a Perl script that successfully downloaded images from ComicsKingdom that worked on wednesday but failed on thursday. So maybe something changed a few days ago, or possibly I stuffed something up.

C:\dosage>dosage ComicsKingdom/PhantomS ComicsKingdom/PhantomSundays> Retrieving 1 strip ComicsKingdom/PhantomSundays> ERROR: URL retrieval of https://comicskingdom.com/phantom-sundays failed: HTTPSConnectionPool(host='comicskingdom.com', port=443): Max retries exceeded with url: /phantom-sundays (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

Also the main thing I am after is the various Phantoms, It fails on the daily, as there are multiple patterns matching Phantom. Looking at the script it scrapes the list, but is not getting the vintage Phantoms. If I alter the script it generates will that stick ? Or does that need to be done somewhere else ?

I have some familiarity with Python, so I could have a go myself. But have no idea where to start and how the whole project ties together.

littauer commented 5 years ago

The SSL problem is new (Thursday seems right). ComicsKingdom is now looking for a SSL client certificate or it won't accept connection.

This doesn't seem to be a problem for browsers but is for Python's "request" mechanism and maybe for curl or wget.

As for Phantom, are you using my version of dosage? See pull request # 134.

Phantom was one of the ones I used as a test; did you try ComicsKingdom/phantom? Not that it matters until the SSL issue gets fixed. If you find a fix before I do, please update.

The documentation could be better (ain't that always true?) but the package itself is pretty well designed. If you're comfortable with python objects it should go easily.

littauer commented 5 years ago

Update: wget and curl fail also, but wget says:

Unable to locally verify the issuer's authority. To connect to comicskingdom.com insecurely, use `--no-check-certificate'.

If you do that, it'll work.

This makes me think client cert is not what's needed but working around bad TLS configuration is.

Looking into "chained certificates" next.

If insecure wgets work for you, go for it.

dabolter commented 5 years ago

My original Perl script was basically generating a phantomjs command to handle javascript and then executing it. Adding an ignore SSL errors flag to the command now allows it to work.

It would then parse the html generated and try to extract the image and then download it. Which I think was done with wget.

However dosage is a far better package than my old script running on an old Mac. Where do I need to make the change to add `--no-check-certificate' ?

Also regarding the Phantom, if I try dosage ComicsKingdom/PhantomS I get the Sunday But if I try dosage ComicsKingdom/Phantom I get the error about multiple matches. Is there a wildcard I can add to get both ?

littauer commented 5 years ago

dosage doesn't use wget but rather the Python "request" mechanism. I was hoping you could at least use your former script to get comics while I worked and it seems you can.

The comicskingdom.com SSL certificate is badly formed in that it is not complete. It relies on the browser (dosage in this case) to chase through the chain of "Authority Information Access" URLs to gather a full set or bundle of certificates all the way from the root authority to comicskingdom. Some browsers (e.g. Internet Explorer) do this, some do not (e.g. wget).

I'm working on chasing the chain and then teaching dosage how to use the fixed certificate. It's slow going.

When I can access comicskingdom.com again I'll address Phantom.

littauer commented 5 years ago

To get past the SSL error message, you need to provide dosage a complete and correct certification bundle by pointing to it with the REQUESTS_CA_BUNDLE environment variable before you execute dosage.

The bundle is a single file made up by concatenating three certificates IN THIS ORDER:

The root GoDaddy.com CA certificate (I used gdroot-g2.crt) The GoDaddy Intermediate certificate (I used gdig2.crt.pem) The ComicsKingdom.com certificate

GoDaddy certs are available at https://ssl-ccp.godaddy.com/repository?origin=CALLISTO

Getting the ComicsKingdom cert is easy enough but is platform dependent. I've attached the one I use but you shouldn't trust it. ckcert1.zip

ComicsKingdom also changed the place they kept the image file; I've updated the pull request to reflect the change.

dosage ComicsKingdom/Phantom and dosage ComicsKingdom/PhantomSundays work fine for me.

dabolter commented 5 years ago

dosage ComicsKingdom/Phantom still fails for me however the other bits are now working.

I extracted the Comic Kingdoms certificate myself and downloaded the others you mentioned. Setting the Requests environment variable and pointing to the dosage folder failed, but blanking the variable again was enough to get it working.

Possibly putting them in the root of the dosage folder was enough.

As above the Phantom remains a problem, however by creating the folder ComicsKingdom\Phantom dosage @ will fetch it.

So I have another work around.

Now my dosage is based on the standard one and then with the 4 files mentioned in your changes manually inserted. Is that enough ? Could some of my problems be resolved by refreshing the install with your full set ? If so is there a git command or something I can do ? I am running this on windows 10 if that makes a difference

Also is there anyway of getting the other Vintage Phantom ? https://www.comicskingdom.com/phantom-1

There are a few others that are duplicated in the Vintage area with the same issue, eg Flash Gordon

On Tue, Jun 18, 2019 at 6:12 AM littauer notifications@github.com wrote:

To get past the SSL error message, you need to provide dosage a complete and correct certification bundle by pointing to it with the REQUESTS_CA_BUNDLE environment variable before you execute dosage.

The bundle is a single file made up by concatenating three certificates IN THIS ORDER:

The root GoDaddy.com CA certificate (I used gdroot-g2.crt) The GoDaddy Intermediate certificate (I used gdig2.crt.pem) The ComicsKingdom.com certificate

GoDaddy certs are available at https://ssl-ccp.godaddy.com/repository?origin=CALLISTO

Getting the ComicsKingdom cert is easy enough but is platform dependent. I've attached the one I use but you shouldn't trust it. ckcert1.zip https://github.com/webcomics/dosage/files/3298659/ckcert1.zip

ComicsKingdom also changed the place they kept the image file; I've updated the pull request to reflect the change.

dosage ComicsKingdom/Phantom and dosage ComicsKingdom/PhantomSundays work fine for me.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/webcomics/dosage/issues/132?email_source=notifications&email_token=AMLXD7MBLH3UCKUE3J4ZY6LP27VZZA5CNFSM4HIADUAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4J6RA#issuecomment-502832964, or mute the thread https://github.com/notifications/unsubscribe-auth/AMLXD7MAPTSRLVZ7O4FIXOTP27VZZANCNFSM4HIADUAA .

littauer commented 5 years ago

REQUESTS_CA_BUNDLE needs to point to the concatenated SSL file to be effective. I think setting it to null forces use of the local system's certificate store. Not setting it uses Python Request's default setting. Only the first worked for me; I'm glad another choice is working for you.

I see that you're using Windows while I use Linux (Opensuse variant). One major difference is that case is significant in Linux and not in Windows... you might look into that. Also watch out for \ vs. /

Using git isn't magic; anything other than dosage/dosagelib/plugins/comicskingdom.py is installation oriented. dosage/scripts/comicskingdom.py is intended to discover what comics comicskingdom is providing but clearly the script missed some. Others miss as well. You can add your own manually; look at gocomics.py (both in scripts and plugins) as an example.

It looks like the link for prior comics is changed as well; I need to look into that.

Good luck! I'll help if I can but I have other projects going. I am not a dosage maintainer; just an open source fan.

TobiX commented 5 years ago

For the SSL problem (an incomplete certificate chain, which is a stupid server misconfiguration, which people don't notice, since it works in the browser, see SSLLabs) the simplest way to get this working would be to add an insecure flag to the comic module, which would then be passed to requests. That avoids mucking around with certificates and can easily be dropped whenever the webmaster grows a clue...

TobiX commented 5 years ago

Ans the SSL issues seem to be fixed today :+1:

TobiX commented 4 years ago

Merged #134, so this should be "fixed" for now. Maybe someone wants to move the broken ones to https://github.com/webcomics/dosage/blob/master/dosagelib/plugins/old.py :wink: