Open ron-wolf opened 5 years ago
I am apparently affected by a similar issue on linux CLI, in variuos attempts to dump a website it hangs forever printing after printing this last message
https://www.w3.org/2018/json-ld-wg/ (9987 bytes) - OK
the command I use is httrack --continue --ext-depth=1 "https://XXXXXXXXXX/"
Cheers!
When I press Ctrl+c
to stop it I get this message ** Finishing pending transfers.. press again ^C to quit.
unless I press Ctrl+c
again it stays like that forever
This is what I found in the last lines of the log
$ tail hts-log.txt
17:03:03 Warning: engine: warning: temporary file en.wikipedia.org/wiki/Internet_of_things.html.tmp already exists
17:03:04 Warning: engine: warning: temporary file en.wikipedia.org/w/indexa116.html.tmp already exists
17:03:05 Warning: engine: warning: temporary file en.wikipedia.org/wiki/Hypertext_Application_Language.html.tmp already exists
17:03:06 Warning: engine: warning: temporary file en.wikipedia.org/w/indexbfd3.html.tmp already exists
17:03:09 Warning: engine: warning: temporary file json-ld.org/spec/latest/json-ld-syntax/index.html.tmp already exists
17:03:23 Warning: engine: warning: temporary file www.w3.org/2011/rdf-wg/wiki/Main_Page.html.tmp already exists
17:04:17 Warning: engine: warning: temporary file www.w3.org/TR/2014/REC-json-ld-20140116/index.html.tmp already exists
17:04:17 Warning: engine: warning: temporary file json-ld.org/index.html.tmp already exists
17:04:17 Warning: engine: warning: temporary file www.w3.org/2018/json-ld-wg/index.html.tmp already exists
02:28:13 Error: Exit requested by shell or user
I installed the CLI for
httrack
using Homebrew. Many of the sites I try to mirror contain somexmlns:
link attributes in their main<html>
tags (e.g.,xmlns:og=
,xmlns:rdfs=
). For these sites, HTTrack spends all its time crawling those format specification URIs, rather than the site itself; and the request always takes forever, which is surprising considering most of these files are mere kilobytes in size.I’ve tried more options than I can recall, including
--priority=1
,--priority=7
,--can-go-down
, and--stay-on-same-address
, but nothing works. I’m not sure what MIME type these links are, but if I knew, I could try--disable-module
. Regardless, I’m not sure why HTTrack ignores domain restrictions for these particular URIs. The example URI for reproduction isethnologue.com/21
, and the problem persists regardless of whether itsrobots.txt
is respected.