webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
653 stars 83 forks source link

Crawler is not returning full seed page URL in WARC `WARC-Target-URI` #627

Closed benoit74 closed 4 months ago

benoit74 commented 4 months ago

Seed URL: https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1&gclid=CjwKCAjwuJ2xBhA3EiwAMVjkVK41oNKfKsuOcp6oXd4I1lLYXhgnB4PE3Yg8zSBMPb7jHvZEZbMdBRoCizIQAvD_BwE

Command:

docker run  -v $PWD/output:/output --name crawlme --rm webrecorder/browsertrix-crawler:1.2.0 crawl  --failOnFailedSeed --behaviors "autoplay,autofetch,autoscroll" --url "https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1&gclid=CjwKCAjwuJ2xBhA3EiwAMVjkVK41oNKfKsuOcp6oXd4I1lLYXhgnB4PE3Yg8zSBMPb7jHvZEZbMdBRoCizIQAvD_BwE" --mobileDevice "Pixel 2" --cwd /output --combineWARC

Only one page (the seed) is crawled by this command (a bit weird site to crawl ... but why not !).

Unfortunately, the WARC file contains no entry with a WARC-Target-URI matching the seed URL. Looking at pages.jsonl, I've found the proper WARC-Page-ID and I've found that the WARC record has https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1 as WARC-Target-URI.

This causes us an issue because it means warc2zim cannot just process the WARC file and search for the entry whose WARC-Target-URI is matching the seed URL. We could imagine to process pages.jsonl like I did manually, but this is not really convenient since it means we cannot rely only on the WARC file(s).

Is this behavior intentional or a bug / side-effect of something?

Other than this problem, the WARC is perfect and the transformation to ZIM fully OK if I hack the seed URL before calling warc2zim.

ikreymer commented 4 months ago

It looks like Brave removes the gclid param (Google ad tracking param) as an anti-tracking feature. It generates an artificial redirect to the URL without it, but the way it does, the response was not recorded. It was added to the pageinfo, however. Have a fix that will ensure this is written to WARC.

@benoit74 for your use case though, you could just crawl: https://www.playmobil.com/fr-fr/tiny-house/71509.html