xroche / httrack

HTTrack Website Copier, copy websites to your computer (Official repository)
http://www.httrack.com/
Other
3.38k stars 645 forks source link

Consider separating out the various parts of *URL Hacks* into separate options #271

Open AnyOldName3 opened 8 months ago

AnyOldName3 commented 8 months ago

A site mirror I made didn't work properly until I commented out the guts of jump_normalized_const so it didn't jump over a www. prefix, and then did work afterwards (although this is oversimplifying as I ended up needing to use https://github.com/mitchcapper/httrack so the project would build, and then had to patch out a couple of regressions it had versus this version).

If the options to treat http:// and https:// URLs as the same, treat www.thing.com and thing.com URLs as the same, and to remove redundant slashes were separate instead of under one umbrella URL Hacks setting, I could have just enabled and disabled the bits I needed.

AnyOldName3 commented 8 months ago

I determined that the particular site could in principle have worked with both the http:///https:// equivalence and the www.domain/domain equivalence, but the system to detect when one redirected to another failed when it took more than one step. The http:// URLs redirected to the https:// URLs, which HTTrack handled sensibly, but then the non-www. URLs redirected to the www. ones, which HTTrack didn't bother fetching. I'm guessing that this was misinterpreted as a redirect loop as the URLs were the same after normalisation, but they were different before normalisation.