xmppo / node-expat

libexpat XML SAX parser binding for node.js
https://github.com/xmppo/node-expat
MIT License
385 stars 97 forks source link

Parser sends many text events for the same element #182

Closed javiercbk closed 6 years ago

javiercbk commented 6 years ago

The following snippet produces an unexpected output:

const Promise = require('bluebird');
const { Readable } = require('stream');
const expat = require('node-expat');

const parser = new expat.Parser('UTF-8');

new Promise((resolve, reject) => {
  parser.on('text', (text) => {
    console.log('Text event emmited:', text);
  });

  parser.on('end', () => {
    resolve();
  });

  parser.on('error', (error) => {
    reject(error);
  });

  const s = new Readable();
  s._read = function () {};
  s.pipe(parser);
  s.push(`<?xml version="1.0" encoding="UTF-8"?>
  <test>
    <inner>Expected the text event to push all text at once&lt;sup>&#174;&lt;/sup></inner>
  </test>`);
  s.push(null);
})
.then(() => {
  console.log('success');
  process.exit(1);
})
.catch((err) => {
  console.log('error', err);
  process.exit(1);
});

Output produced:

Text event emmited: 

Text event emmited:     
Text event emmited: Expected the text event to push all text at once
Text event emmited: <
Text event emmited: sup>
Text event emmited: ®
Text event emmited: <
Text event emmited: /sup>
Text event emmited: 

Text event emmited:   
success

A workaround is to check if the endElement event was not emmited before the text event.

Is this behaviour expected?

Thanks

javiercbk commented 6 years ago

The Wikipedia entry for the sax parser api page states that:

the SAX specification deliberately states that a given section of text may be reported as multiple sequential text events. Many parsers, for example, return separate text events for numeric character references

So this is perfectly fine