xmppo / node-expat

libexpat XML SAX parser binding for node.js
https://github.com/xmppo/node-expat
MIT License
385 stars 97 forks source link

Aborted (core dumped): Trying to parse the same XML twice #102

Open marcellodesales opened 9 years ago

marcellodesales commented 9 years ago

Hi there,

Thanks for making this library... Super clean and fast!

So, I'm building a parser for pom.xml files... I built the initial implementation of it, but my first example is failing... :(

var ext = require("../src");

var pom = ext.parsePom({ filePath: __dirname + "/pom.xml"});

console.log(pom);

var pom2 = ext.parsePom({ filePath: __dirname + "/pom.xml", format: "text"});

console.log(pom2);

I can load the xml once, but not twice at the same time... and I believe I have called "parser.destroy()" (https://github.com/marcellodesales/node-pom-parser/blob/master/src/index.js#L77)... The second call always core dumps...

$ node test/
{ timers: { load: '0.168683ms', parse: '1.141196ms' },
  artifactId: 'tynamo-parent',
  groupId: 'org.tynamo',
  version: '0.0.9' }
node: ../node-expat.cc:128: bool Parser::parseString(v8::String&, int): Assertion `buf != __null' failed.
Aborted (core dumped)

Is there anything I should do? Here's the travis execution that's failing https://travis-ci.org/marcellodesales/node-pom-parser

Thanks a lot! Marcello

lloydwatkin commented 9 years ago

I'd suggest looking at the implementation from ltx (https://github.com/node-xmpp/ltx/blob/af1bb571bea24f4c2451dc0f2d41ae37858023b3/lib/sax/sax_expat.js) or alternatively just use ltx directly?

var ltx = require('ltx')
var pom = ltx.parse(pomFileContents)
marcellodesales commented 9 years ago

@lloydwatkin I'd rather have an answer to address the problem other than an alternative solution. Any library should be idempotent... Could you at least confirm the library is broken?

I chose event-based because I don't need to load the entire xml and I'd like to be able to stop reading the stream when I'm done... If var pom = ltx.parse(pomFileContents) means parsing the entire xml document, then that does not fulfill my requirement...

Anyway, I will look for alternatives...

thanks!

marcellodesales commented 9 years ago

Although SAX works, it is a lot slower...

$ node test
{ timers: { load: '0.066483ms', parse: '4.527045ms' },
  groupId: 'org.tynamo.examples',
  artifactId: 'tynamo-example-federatedaccounts',
  version: '0.0.1-SNAPSHOT' }
org.tynamo.different:tynamo-example-super:1.4.5-SNAPSHOT

I will keep this as an alternative... However, I would love to be able to use EXPAT directly....

lloydwatkin commented 9 years ago

It looks like @astro never completed that functionality https://github.com/node-xmpp/node-expat/blob/master/test.js#L332

I've added the test back in and added a reset() call between each call and this works. Could you try that?

marcellodesales commented 9 years ago

Humm... A little bit better... The second document is not parsed... https://github.com/marcellodesales/node-pom-parser/blob/master/test/index.js

The undefined parsing is from a completely different pom2.xml file as shown above...

$ node test
{ artifactId: 'tynamo-parent',
  groupId: 'org.tynamo',
  version: '0.0.9',
  timers: { parse: '0.736577ms', load: '0.045919ms' } }
undefined:undefined:undefined

Here's what I'm trying with this latest commit I pushed... https://github.com/marcellodesales/node-pom-parser/blob/master/lib/expat_parser.js#L46

mdesales@ubuntu [11/24/2014 6:10:15] ~/dev/github/node-pom-parser (master *) $ git diff
diff --git a/lib/expat_parser.js b/lib/expat_parser.js
index efc5f5d..8d60079 100644
--- a/lib/expat_parser.js
+++ b/lib/expat_parser.js
@@ -43,7 +43,7 @@ module.exports.parse = function(xmlContent) {
           parser.on("text", function(text) {
             pom.version = pom.version || text;
             // Done here! So quit fast!
-            parser.destroy();
+            parser.reset();
           });
         }
       });
diff --git a/src/index.js b/src/index.js
index 605ac1f..b3c6023 100644
--- a/src/index.js
+++ b/src/index.js
@@ -56,8 +56,8 @@ module.exports.parsePom = function(opt) {

   // parse the pom, erasing all
   var parsePomStart = process.hrtime();
-  pom = sax.parse(pom.xmlContent);
-  //pom = expat.parse(pom.xmlContent); // EXTREMELY FASTER, BUT BROKEN!
+  //pom = sax.parse(pom.xmlContent);
+  pom = expat.parse(pom.xmlContent); // EXTREMELY FASTER, BUT BROKEN!

   pom.timers = {};
   var parsePomTime = process.hrtime(parsePomStart);

If I reset() outside the event loop, it works, but it has the same result of loading using SAX...

  // Parse, blocking until it ends.
  parser.parse(xmlContent);
  parser.reset();
  return pom;

The result is as slow as SAX as shown below in the parse timer because it goes through the entire doc in memory.

 $ node test/
{ groupId: 'org.tynamo.examples',
  artifactId: 'tynamo-example-federatedaccounts',
  version: '0.0.1-SNAPSHOT',
  timers: { parse: '2.605466ms', load: '0.048994ms' } }
org.tynamo.different:tynamo-example-super:1.4.5-SNAPSHOT
lloydwatkin commented 9 years ago

I'm going to have to hand off to @astro at this point. From the error I can see the buffer isn't supposed to be null, but my bindings skills aren't great.

marcellodesales commented 9 years ago

@lloydwatkin, sounds good! Thanks a lot for helping!