wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
138 stars 44 forks source link

Exposed loadHTML method #63

Closed kolarski closed 6 years ago

kolarski commented 6 years ago

In my use case I've already downloaded the HTML files and needed to parse them locally.

So I exposed a function loadHTML that takes the html as string and then internally converts it into cheerio object.

Here is usage example:

const html = fs.readFileSync('./page.html', 'utf-8');
scrape.loadHTML(html).then(function(metadata){
    console.log(JSON.stringify(metadata, null, 4));
});
kolarski commented 6 years ago

Fixed the unit tests - They pass successfully for Node v4 and v6, but still fail for 0.10 and 0.12 because of ES6 dependency. May be we should turn off testing of 0.10 and 0.12 ?

Added fallback if not preset <html lang=""> to try and get <html xml:lang="">

d00rman commented 6 years ago

Thank you for your contribution @kolarski ! Since your addition enables loading the HTML from files and strings, it would be good to actually export two functions: loadFromFile() and loadFromString(). That way, in your case (and similar ones) you could load the file and convert it in one go. Could you add/rename the functions?

d00rman commented 6 years ago

May be we should turn off testing of 0.10 and 0.12 ?

Fixed in #64. Please rebase this PR on top of master.

kolarski commented 6 years ago

Thanks for the suggestions! I've added the two methods and two unit tests along with them.

kolarski commented 6 years ago

Nice suggestions - updated the PR with promisifyed fs and options object to pass to fs.readFile.

Tested for these example usages: meta.loadFromFile('page.html').then(..) meta.loadFromFile('page.html', {encoding: 'utf-8'}).then(..) meta.loadFromFile('page.html', 'utf-8').then(..) meta.loadFromFile('page.html', {encoding: 'utf-8'}, (err, results) => {..}) meta.loadFromFile('page.html', (err, results) => {..})

How does it look to you ?

d00rman commented 6 years ago

LGTM. @mvolz thoughts?

mvolz commented 6 years ago

Looks good to me, merging. Thanks for your hard work on this!