willemdj / erlsom

XML parser for Erlang
GNU Lesser General Public License v3.0
264 stars 103 forks source link

Parsing large PCDATA #57

Closed dfrese closed 8 years ago

dfrese commented 8 years ago

Hello,

Parsing an xml file with a large PCDATA content (~100 MB, base64 encoded string) with erlsom:scan, seems to need about 10 GB of memory (100 times as much!); no matter whether I pass it a string (char list) or a binary. Is that an 'expected behaviour', or am I using it in wrong way?

Thanks, David.

willemdj commented 8 years ago

Hello David,

Did you try the 'continuation_function' and/or the 'output_encoding' options?

Regards, Willem.

Verstuurd vanaf mijn iPhone

Op 28 jul. 2016 om 19:03 heeft dfrese notifications@github.com het volgende geschreven:

Hello,

Parsing an xml file with a large PCDATA content (~100 MB, base64 encoded string) with erlsom:scan, seems to need about 10 GB of memory (100 times as much!); no matter whether I pass it a string (char list) or a binary. Is that an 'expected behaviour', or am I using it in wrong way?

Thanks, David.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dfrese commented 8 years ago

I now tried output_encoding "utf8" - see no effect on memory consumption. What kind of continuation_function should I pass to reduce the needed memory? - I suspect it has somethings to do with character lists used internally while parsing?

willemdj commented 8 years ago

No, with the continuation function you can pass the xml in several parts to the parser, so it is not necessary to have everything in memory. See the reference documentation.

Op 28 jul. 2016 om 20:57 heeft dfrese notifications@github.com het volgende geschreven:

I now tries output_encoding "utf8" - see no effect on memory consumption. What kind of continuation_function should I pass to reduce the needed memory? - I suspect it has somethings to do with character lists used internally while parsing?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

dfrese commented 8 years ago

Ah, sorry for not trying that first! But it does not help. Just takes a bit longer before mem goes up (maybe it's even more memory). Should that work even within a single PCDATA content of an element?

willemdj commented 8 years ago

Unfortunately I am not able to really investigate at this moment, sorry.

Op 28 jul. 2016 om 22:25 heeft dfrese notifications@github.com het volgende geschreven:

Ah, sorry for not trying that first! But it does not help. Just takes a bit longer before mem goes up (maybe it's even more memory). Should that work even within a single PCDATA content of an element?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

dfrese commented 8 years ago

Think I got it; my fault. I passed a string as the output_encoding, not an atom. Thanks for the quick responses and help.

(Maybe an error could be thrown for invalid output_encodings instead of falling back to the list default... like with the encoding option, which must be a string, not an atom ;-))

willemdj commented 8 years ago

Okay, thanks for letting me know.

Op 29 jul. 2016 om 18:26 heeft dfrese notifications@github.com het volgende geschreven:

Think I got it; my fault. I passed a string as the output_encoding, not an atom. Thanks for the quick responses and help.

(Maybe an error could be thrown for invalid output_encodings instead of falling back to the list default... like with the encoding option, which must be a string, not an atom ;-))

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.