CLI insistently requests schemas, which halts parsing

xroche / httrack

HTTrack Website Copier, copy websites to your computer (Official repository)

http://www.httrack.com/

Other

3.57k stars 650 forks source link

CLI insistently requests schemas, which halts parsing #191

Open ron-wolf opened 5 years ago

ron-wolf commented 5 years ago

I installed the CLI for httrack using Homebrew. Many of the sites I try to mirror contain some xmlns: link attributes in their main <html> tags (e.g., xmlns:og=, xmlns:rdfs=). For these sites, HTTrack spends all its time crawling those format specification URIs, rather than the site itself; and the request always takes forever, which is surprising considering most of these files are mere kilobytes in size.

I’ve tried more options than I can recall, including --priority=1, --priority=7, --can-go-down, and --stay-on-same-address, but nothing works. I’m not sure what MIME type these links are, but if I knew, I could try --disable-module. Regardless, I’m not sure why HTTrack ignores domain restrictions for these particular URIs. The example URI for reproduction is ethnologue.com/21, and the problem persists regardless of whether its robots.txt is respected.

G10h4ck commented 1 year ago

I am apparently affected by a similar issue on linux CLI, in variuos attempts to dump a website it hangs forever printing after printing this last message https://www.w3.org/2018/json-ld-wg/ (9987 bytes) - OK

the command I use is httrack --continue --ext-depth=1 "https://XXXXXXXXXX/"

Cheers!

G10h4ck commented 1 year ago

When I press Ctrl+c to stop it I get this message ** Finishing pending transfers.. press again ^C to quit. unless I press Ctrl+c again it stays like that forever

G10h4ck commented 1 year ago

This is what I found in the last lines of the log

$ tail hts-log.txt 
17:03:03        Warning:        engine: warning: temporary file en.wikipedia.org/wiki/Internet_of_things.html.tmp already exists
17:03:04        Warning:        engine: warning: temporary file en.wikipedia.org/w/indexa116.html.tmp already exists
17:03:05        Warning:        engine: warning: temporary file en.wikipedia.org/wiki/Hypertext_Application_Language.html.tmp already exists
17:03:06        Warning:        engine: warning: temporary file en.wikipedia.org/w/indexbfd3.html.tmp already exists
17:03:09        Warning:        engine: warning: temporary file json-ld.org/spec/latest/json-ld-syntax/index.html.tmp already exists
17:03:23        Warning:        engine: warning: temporary file www.w3.org/2011/rdf-wg/wiki/Main_Page.html.tmp already exists
17:04:17        Warning:        engine: warning: temporary file www.w3.org/TR/2014/REC-json-ld-20140116/index.html.tmp already exists
17:04:17        Warning:        engine: warning: temporary file json-ld.org/index.html.tmp already exists
17:04:17        Warning:        engine: warning: temporary file www.w3.org/2018/json-ld-wg/index.html.tmp already exists
02:28:13        Error:  Exit requested by shell or user