xroche / httrack

HTTrack Website Copier, copy websites to your computer (Official repository)
http://www.httrack.com/
Other
3.61k stars 655 forks source link

Serious bottleneck in download speed #175

Closed rchl closed 9 months ago

rchl commented 6 years ago

Download speeds are seriously bottlenecked by the code somewhere. Using command like:

httrack URL -O "test" -%v --disable-security-limits --max-rate=0

and downloading let's say 4 big video files at the same time, I'm getting around 400KB/s total download speed while manually downloading those files (with wget or browser) I can easily get over 10MB/s.

It's not due to network or the fact that 4 files are being downloaded at the same time. Why do I think so? Because canceling mirroring with ctrl+c fixes download speeds as soon as one of the file completes.

Steps:

  1. Make httrack start downloading 4 big files at the same time.
  2. Press ctrl+c to queue termination
  3. Wait for one of the files to complete downloading.

What happens: All 3 remaining files suddenly get "super speed" and finish quickly.

That makes me think that there is some very socket/cpu intensive code that is running while files are being downloaded and makes download rates go to hell. When termination is triggered (I assume something to do with opt->state.stop handling), that code probably is no longer running and all download speeds are no longer bottlenecked.

rchl commented 6 years ago

(I don't have any public URL to test with unfortunately)

dustmoo commented 6 years ago

@rchl From what I understand if you queue a termination the other files are not downloaded. In the logs it tells you this is because of termination

rchl commented 6 years ago

Yes but that is irrelevant to the issue. It's the files that are already downloading that get their speed boosted.

rchl commented 6 years ago

I've set up a test page with couple links to big (200MB) files:

So to test you can do:

httrack https://c8c3dbef-cee9-4765-b7f7-21b5e2ed6d1f.htmlpasta.com/ -O "test-download" -%v --disable-security-limits --max-rate=0 -%e1

The results I see when downloading, before triggering cancel, show around 2MB/s:

Bytes saved:    21,10MiB           Links scanned:   2/7 (+1)
Time:   16s                        Files written:   2
Transfer rate:  1,87MiB/s (1,31MiB/s)  Files updated:   3
Active connections:     4          Errors:  0

Current job: waiting (throttle)
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?1    8,16MiB /   200,00MiB
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?3    6,01MiB /   200,00MiB
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?2    6,92MiB /   200,00MiB
 request -  https://www.google-analytics.com/analytics.js   121B /  8,00KiB

After cancelling (ctrl+c), the speed increases to 5MB/s for already downloading files:

Bytes saved:    90,48MiB           Links scanned:   5/11 (+0)
Time:   31s                        Files written:   3
Transfer rate:  4,90MiB/s (2,91MiB/s)  Files updated:   3
Active connections:     3          Errors:  1

Current job: receiving files
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?1    37,47MiB /  200,00MiB
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?3    29,44MiB /  200,00MiB
 receive -  ipv4.download.thinkbroadband.com/200MB.zip?2    23,53MiB /  200,00MiB

To me it seems like there is some queue handling code that bottlenecks downloading speeds.

rchl commented 6 years ago

NOTE: After canceling transfer, you need to wait for at least one file to finish downloading before transfer boost happens. Only then you see Current job status change from waiting (throttle) to receiving files and transfer speeds increase.

llewlem888 commented 5 years ago

@rchl I'm experiencing the same thing. Painfully slow through httrack, about 30KiB/s per socket/open_connection but when I dl a vid file through uget, it's about 3-5MiB/s dl speed. Very frustrating. Just DLing the plain html is also painfully slow, and I think more program runtime is spent in "waiting (throttle)" mode than actually DLing.

I run httrack 3.49-2 on CLI in bash under XFCE on top of OpenSUSE Leap 15.

godbout commented 5 years ago

I'm having issues too. I'm in HK and my own VPS is in Singapore. Pages load in 20ms. I get around 1KiB/s with the following:

"httrack 'https://{$docset->url()}' \
--path 'storage/{$docset->code()}' \
--connection-per-second=50 \
--sockets=80 \
--keep-alive \
--display \
--verbose \
--advanced-progressinfo \
--disable-security-limits \
-s0 \
-o0 \
-F 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' \
--max-rate=0 \
--depth=5"
clutchbeyers commented 4 years ago

Just use wget:

This example using Chrome browser on a mac

If you are trying to download content only viewable when authenticated and authorized, you need to perform the login on the browser of the site you want to clone so that the session data can be referenced.

wget -r -l --html-extension --convert-links --page-requisites --adjust-extension --load-cookies /Users/user/Library/Application Support/Google/Chrome/Default/Cookies inf **<http/https link>**