wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
163 stars 44 forks source link

updating scraping.js to fix HTTP 406 error while runing the test suite #99

Closed Jacobojijo closed 2 weeks ago

Jacobojijo commented 1 month ago

This is to fix issue #95

Solution

I modified the scraping.js file to use more browser-like User-Agent and Accept headers. Here are the key changes:

  1. Added constants for user agent and accept header:

    const userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36';
    const acceptHeader = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';
    
    * Created a `getWithHeaders` function to make requests with these headers:
    
    ```javascript
    function getWithHeaders(url) {
        return preq.get({
            uri: url,
            headers: {
                'User-Agent': userAgent,
                'Accept': acceptHeader
            }
        });
    }
Jacobojijo commented 1 month ago

@mvolz, any PR review on this?

Jacobojijo commented 1 month ago

@mvolz, I realized the issue was that the package.json was not in sync with package-lock.json because there was an update of package.json that was not reflected on package-lock.json. I have fixed and you can now test the PR.

Jacobojijo commented 1 month ago

@mvolz, PR merge?

mvolz commented 2 weeks ago

Thanks!