updating scraping.js to fix HTTP 406 error while runing the test suite

Jacobojijo commented 1 month ago

This is to fix issue #95

Solution

I modified the scraping.js file to use more browser-like User-Agent and Accept headers. Here are the key changes:

Added constants for user agent and accept header:

const userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36';
const acceptHeader = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';

* Created a `getWithHeaders` function to make requests with these headers:

```javascript
function getWithHeaders(url) {
    return preq.get({
        uri: url,
        headers: {
            'User-Agent': userAgent,
            'Accept': acceptHeader
        }
    });
}

Updated all preq.get() calls to use getWithHeaders() instead.
Modified the meta() function calls to include the headers:
```
return meta({
    uri: url,
    headers: {
        'User-Agent': userAgent,
        'Accept': acceptHeader
    }
})
```
This should improve the reliability of the test suite, especially when dealing with websites that have stricter requirements for incoming requests.

Jacobojijo commented 1 month ago

@mvolz, any PR review on this?

Jacobojijo commented 1 month ago

@mvolz, I realized the issue was that the package.json was not in sync with package-lock.json because there was an update of package.json that was not reflected on package-lock.json. I have fixed and you can now test the PR.

Jacobojijo commented 1 month ago

@mvolz, PR merge?

mvolz commented 2 weeks ago

Thanks!

wikimedia / html-metadata

updating scraping.js to fix HTTP 406 error while runing the test suite #99

Solution