Closed benoit74 closed 4 months ago
It looks like Brave removes the gclid
param (Google ad tracking param) as an anti-tracking feature. It generates an artificial redirect to the URL without it, but the way it does, the response was not recorded.
It was added to the pageinfo, however. Have a fix that will ensure this is written to WARC.
@benoit74 for your use case though, you could just crawl: https://www.playmobil.com/fr-fr/tiny-house/71509.html
Seed URL: https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1&gclid=CjwKCAjwuJ2xBhA3EiwAMVjkVK41oNKfKsuOcp6oXd4I1lLYXhgnB4PE3Yg8zSBMPb7jHvZEZbMdBRoCizIQAvD_BwE
Command:
Only one page (the seed) is crawled by this command (a bit weird site to crawl ... but why not !).
Unfortunately, the WARC file contains no entry with a
WARC-Target-URI
matching the seed URL. Looking atpages.jsonl
, I've found the properWARC-Page-ID
and I've found that the WARC record hashttps://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1
asWARC-Target-URI
.This causes us an issue because it means warc2zim cannot just process the WARC file and search for the entry whose
WARC-Target-URI
is matching the seed URL. We could imagine to processpages.jsonl
like I did manually, but this is not really convenient since it means we cannot rely only on the WARC file(s).Is this behavior intentional or a bug / side-effect of something?
Other than this problem, the WARC is perfect and the transformation to ZIM fully OK if I hack the seed URL before calling warc2zim.