Closed yacylover closed 3 years ago
Have you tested this with local file importing and the same file?
I just did an additional test with local file importing. After opening the local file and used the 'Import warc file' button nothing happens and no errors or exceptions in the log. Using the method with the remote webserver runs a few secs and shows the processing screen before it aborts and 'chunked stream ended unexpectedly' is shown in the log.
Ok just pushed a fix: this was cause by single WARC entries with faulty protocol elements (i.e. superfluous empty lines) which made the process to crash. I am now skipping those entries.
What tool did you use to create the WARC file?
Thank you very much! Typically I use Heritrix or Offline Explorer Enterprise at work. But some ISPs are assessing web scraping as a trojan :-( and I'm getting a warning mail from their abuse dept. - unfortunately for crawling with YaCy, too :-( Since a few months I'm using a VPN for better protection. The WARC file linked above was created with Webrecorder because some sites are even detecting and blocking the VPN :-( But imho some sites need to be preserved for our descendants. Webrecorder is good for these special cases where other tools don't work.
Hi,
Importing Gzip compressed WARC files works very well now. @Orbiter thank you very much for the fix. But in the case of some WARC archives only the first entry is processed. The log shows the following:
I 2020/12/26 17:35:46 WarcImporter chunked stream ended unexpectedly
I deposited the WARC file on a local webserver and I'm using the Url method for importing.
To reproduce the error, I just uploaded the WARC where the error occurs here.
A fix would be very helpful.
Best
LA_FORGE