willemdj / erlsom

XML parser for Erlang
GNU Lesser General Public License v3.0
265 stars 103 forks source link

xmlString/1 escape speedup #50

Closed zuckschwerdt closed 8 years ago

zuckschwerdt commented 8 years ago

Binaries are better suited to crunch big strings. This introduces escapeBinary/1 alongside escapeChar/1. Also escapeBinary/1 is called conditionally, which will result in a massive speedup for e.g. large chunks of Base64 encoded data. (You might want to remove the comments, especially the one about "not the fastest way to do this";) )

zuckschwerdt commented 8 years ago

This breaks existing calls to xmlString that expect a list result. Sorry, I'll rewrite it.

willemdj commented 8 years ago

erlsom:write was a bit of an afterthought... It is certainly not very efficient.

If you are looking for speed it is probably not difficult to add a significantly faster variant that returns a binary or an iolist. Then your suggestions on how to speed up xmlString would be quite valuable.

zuckschwerdt commented 8 years ago

I already looked into that and did some tests. Do you consider erlsom_write:write part of the stable API? I made a small change that effects flatten calls as late as possible and also let's me get the raw iolist. Using an iolist result I get a 40x speedup on that call and an overall throughput of 11MB/s instead of 3MB/s. See https://github.com/frobese/erlsom/commit/9952633c0a05cd2d6aca6096294c1faf33a19d57 -- it's WIP and I really like to hear your thoughts on that.

willemdj commented 8 years ago

Hi,

I would say that erlsom:write() is part of the API.

My suggestion would be to create a new function that does what you have been proposing: replace A ++ B by [A, B] everywhere, don't convert binary values to lists (but do check that they don't contain 'special' characters, and replace them if they do). The result is an iolist, which is fine in most cases. The new function could be called erlsom:write_iolist(), erlsom:io_write() or whatever. erlsom:write() remains for whoever wants to use it, but it simply calls the new function and converts the result to a list of unicode characters. With some luck the performance will be the same or even better.

regards, Willem

On Fri, Dec 11, 2015 at 9:21 AM, Christian W. Zuckschwerdt < notifications@github.com> wrote:

I already looked into that and did some tests. Do you consider erlsom_write:write part of the stable API? I made a small change that effects flatten calls as late as possible and also let's me get the raw iolist. Using an iolist result I get a 40x speedup on that call and an overall throughput of 11MB/s instead of 3MB/s. See frobese@9952633 https://github.com/frobese/erlsom/commit/9952633c0a05cd2d6aca6096294c1faf33a19d57 -- it's WIP and I really like to hear your thoughts on that.

— Reply to this email directly or view it on GitHub https://github.com/willemdj/erlsom/pull/50#issuecomment-163872547.

zuckschwerdt commented 8 years ago

Can you try this patch again? With latest master this works for all my tests cases. A quick check on memory consumption:

Xsd = "<?xml version='1.0' encoding='UTF-8'?>
       <xs:schema
           xmlns:xs='http://www.w3.org/2001/XMLSchema'
           elementFormDefault='qualified' attributeFormDefault='unqualified'>
           <xs:element name='items'>
               <xs:complexType>
                   <xs:sequence>
                       <xs:element name='item' type='xs:string' />
                   </xs:sequence>
               </xs:complexType>
           </xs:element>
       </xs:schema>",
{ok, Model} = erlsom:compile(Xsd),
Data = binary:copy(<<"0123456789">>, 100000), % 1 MB
Doc = {items, [], Data},
erlang:display(erlang:memory()),
Xml = erlsom:write(Doc, Model),
erlang:display(erlang:memory()).

Comparing the memory stats gives a total consumption of ~60x data size. That is really bad -- encoding a 100MB data item would need 6GB memory...

willemdj commented 8 years ago

The patch changes the behavior of erlsom:write/2. According to the documentation it should return a list of integers (unicode code points), but with the patch the return value can also contain binaries.

But I see your point, so I'll make some additional changes to fit this in.

zuckschwerdt commented 8 years ago

Sorry. Yes, I thought about using unicode:characters_to_list in erlsom:write/2, I guess lists:flatten won't do. I got another small patch lined up that will improve speed. See https://github.com/frobese/erlsom/commit/e7b0438a61db64e3c7d7ecf834f7ef81b5999e2e in https://github.com/frobese/erlsom/tree/feat-late-flatten