wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
163 stars 44 forks source link

Fix HTTP 406 errors in scraping tests #95

Closed Jacobojijo closed 2 weeks ago

Jacobojijo commented 1 month ago

I encountered an issue while running the test suite for the html-metadata project. Two tests in the scraping.js file were failing with HTTP 406 errors (npm test):

The "nested Twitter data from www.theguardian.com" test in the parseTwitter function The "should return an object or array and get correct data" test for The Guardian URL in the parseJsonLd function

These errors were causing the test suite to fail: 1) scraping parseTwitter function nested Twitter data from www.theguardian.com: HTTPError: 406: http_error

2) scraping parseJsonLd function https://www.theguardian.com/commentisfree/2024/mar/08/the-guardian-view-on-wikipedias-female-volunteers-a-hive-heroism-that-changes-history should return an object or array and get correct data: HTTPError: 406: http_error

Problem: The issue appears to be related to the User-Agent and Accept headers being sent with the HTTP requests. Some websites, including The Guardian, seem to be rejecting requests with the default headers used by the preq library.