wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
165 stars 44 forks source link

JSON-LD content embedded in CDATA not parsed #79

Open kael opened 6 years ago

kael commented 6 years ago

Trying to parse a Recipe described in a JSON-LD content embedded in CDATA returns nothing.

The sample:

<script type="application/ld+json">
    //<![CDATA[
    {"@context":"http://schema.org","@type":"Recipe","name":"Ile flottante","recipeCategory":"\u00eele flottante","image":"https://image.afcdn.com/recipe/20130408/34776_w1024h768c1cx256cy192.jpg","datePublished":"2003-03-31T07:21:00+02:00","prepTime":"PT15M","cookTime":"PT30M","totalTime":"PT45M","recipeYield":"4 personnes","recipeIngredient":["60 cl lait","60 g sucre en poudre","1 gousse vanille","5 oeuf","1 pinc\u00e9e sel","130 g sucre glace","60 g amande"],"recipeInstructions":[{"@type":"HowToStep","text":"Casser le \u0153ufs en s\u00e9parant les blancs des jaunes."},{"@type":"HowToStep","text":"Monter les blancs en neige en y ajoutant une pinc\u00e9e de sel. Mettre petit \u00e0 petit le sucre glace."},{"@type":"HowToStep","text":"Mettre le m\u00e9lange dans un moule \u00e0 charlotte recouvert d'aluminium et beurr\u00e9."},{"@type":"HowToStep","text":"Cuire au bain-marie dans un four \u00e0 210\u00b0C (thermostat 7) pendant 25 \u00e0 30 minutes."},{"@type":"HowToStep","text":"Faire une cr\u00e8me anglaise en chauffant le lait avec une gousse de vanille. Battre les jaunes d\u2019\u0153ufs et le sucre pour les faire mousser. Ajouter petit \u00e0 petit le lait chaud \u00e0 la vanille."},{"@type":"HowToStep","text":"\u00c9paissir le m\u00e9lange au bain-marie et arr\u00eater lorsque la cr\u00e8me nappe la cuill\u00e8re."},{"@type":"HowToStep","text":"D\u00e9moulez l'\u00eele. Saupoudrez le dessus d'amandes effil\u00e9es."},{"@type":"HowToStep","text":"Versez la cr\u00e8me anglaise tout autour de l'\u00eele et mettez au r\u00e9frig\u00e9rateur jusqu'au moment de servir."}],"author":"Sinfonia","description":"lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","keywords":"Ile flottante, \u00eele flottante, lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","aggregateRating":{"@type":"AggregateRating","reviewCount":44,"ratingValue":4.3,"worstRating":0,"bestRating":5}}
    //]]>
</script>

It seems Cheerio doesn't handle that case. I'm using that quick fix for the parsing:


-- contents = JSON.parse(this.children[0].data);
++ contents = JSON.parse(this.children[0].data.replace(/\n    \/\//g, '').replace(/\n/g, '').replace(/<!\[CDATA\[(.*?)]]>/, '$1').trim());

There's certainly a better way to clean the content.