xroche / httrack

HTTrack Website Copier, copy websites to your computer (Official repository)
http://www.httrack.com/
Other
3.38k stars 645 forks source link

Question | What program can I use if I only want a list of files/links in a text file? #259

Closed Stamimail closed 8 months ago

Stamimail commented 1 year ago

Instead of a mirror of a website, I want a mirror of the structure of the website. To know about the existence and location of the files (urls), without downloading the files.

Alternatively, Instead of a list of files in a text file, you can make a mirror just like httrack, but instead of downloading the files from the website, the program will create on my computer all the (folders and) files as empty files with 0 bytes, and just name them like the original filenames on the website.

According to my little research, the programmer should use the "HEAD" request instead of the "GET" request for this purpose.

gamebeaker commented 1 year ago

@Stamimail i don't really understand what your use case is. One "solution" for online books could be to use WebToEpub it is a browser extension which scans the url for chapter links. You can copy a list of all the Chapter links in the Extension. "Solution" with httrack:

-O "C:\temp\github\01" -r2 -A100 -%c1 -c1 -n -Z -%v -s0 -F "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826" "-*" in hts-log.txt you can extract the links with regex: /\(wizard\) explicit forbidden \(-\*\) link: .+/gm One limitation is that it only works if all your links are on one page or you need to use a URL list.
Stamimail commented 1 year ago

For example, I used WinHTTrack.exe (GUI) to copy this website to my PC. Result: ‎~100 MB, ~3300 folders, ~3700 files. This is the normal behaviour. Only after downloading to my PC I can get the website structure. See 1.csv, file list exported to CSV file.

Now, I don't want to download the whole website. I want (at first) to get the website structure. The output will be a list of files/links in a text file. See 2.txt and 3.txt to understand the concept. (ignore what needs to be ignored) Result: a text file less than 1 MB.

The user gets a Preview of the website structure, in a text file. Later on, list of files - the user will be able to import this file list to a file explorer and get a file explorer view about website structure. list of links - the user will be able to check links in web browser (trial and error), and know better about the the website structure and where is located what he needs.

1.csv 2.txt 3.txt

gamebeaker commented 1 year ago

@Stamimail I think i found an alternative solution for you but it still downloads all pages(not as html etc but in cache). The problem is, that if you want the structure of a site you need to download the pages so you can extract the links wich link to other parts of the site and than get the links from these sites etc. Warning: The Software could be a virus etc. Since the Software exists since ancient times i don't think so but just be cautious. I used the software in a VirtualBox. Xenu It pretty much does what you want. After installing it File->Check URL... ->Enter your URL "https://books.toscrape.com/"->I disabled "Check external links" ->more options...->I slowed it down to one thread and Maximum depth 5 for my test. As it is blazing fast and i coudn't find a slow down option i would use a vpn/ proxy and try to limit the Internet speed of the VM to prevent cloudflare from banning you because you are a bot/ ddos attacker. Here you can see how to configure the proxy. After you have scanned the site you can save the links as tab seperated file(old csv?) File->Export to TAB seperated file... If you need it more often you can try to use the cli of Xenu.

gamebeaker commented 1 year ago

@Stamimail here is the file i created using this program. As it is a site for scrapping i didn't set limits it took 3.5min with 100 threads and it shows me 3214 links my depth was 200. 100threads.txt

bogasamai commented 1 year ago

Are any of you able to compile the version on github? I am getting errors. checking for inflateEnd in -lz... yes checking zlib.h usability... yes checking zlib.h presence... yes checking for zlib.h... yes checking for inflateEnd in -lz... (cached) yes checking zlib in /usr... ok checking whether to enable https support... yes checking for EVP_get_digestbyname in -lcrypto... no checking for SSL_CTX_new in -lssl... no configure: error: not available configure: WARNING: cache variable lt_cv_path_LD contains a newline

gamebeaker commented 12 months ago

@Stamimail is the problem solved? If yes please close this issue. If not why not?

Stamimail commented 8 months ago

@xroche, will this feature be or not be supported in httrack?