willemdj / erlsom

XML parser for Erlang
GNU Lesser General Public License v3.0
264 stars 103 forks source link

Transform the parser to CPS. Support pull parsing #51

Open ppolv opened 8 years ago

ppolv commented 8 years ago

I wonder if you will be willing to merge something in the lines of this patch. (In such case this will need more tests and performance comparison)

The basic idea is:

For pull parsing, I mean:

3> {continue, C1, E1} = erlsom:parse_pull(<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>, [{output_encoding, utf8}]).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
          [startDocument,
           {startPrefixMapping,"s","http://some-namespace.com"},
           {startElement,"http://some-namespace.com","xstream","s",
                         [{attribute,"version",[],[],<<"1.0">>},
                          {attribute,"as",[],[],<<"234">>}]}]}
4> {continue, C2, E2} = C1(<<"<a href='ab">>).
{continue,#Fun<erlsom_sax_lib.0.36439255>,[]}
5> {continue, C3, E3} = C2(<<"c'><b>ss</b></a>">>).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
          [{startElement,[],"a",[],
                         [{attribute,"href",[],[],<<"abc">>}]},
           {startElement,[],"b",[],[]},
           {characters,<<"ss">>},
           {endElement,[],"b",[]},
           {endElement,[],"a",[]}]}
6> {ok, E4, _Rest} = C3(<<"</s:xstream> tail data">>).   
{ok,[{endElement,"http://some-namespace.com","xstream","s"},
     {endPrefixMapping,"s"},
     endDocument],
    " tail data"}

That is, allow the user of the parser to pass the data in chunks to the parser, instead of the parser asking for more in a callback.

This allow simpler integration for long running parsing (think in infinite xml streams coming from network), and also allows hibernating the parsing process, that on current version is not always possible (as hibernate discard the stack).

Returning the list of events instead of using sax callbacks is mostly a matter of preference, sax callbacks can still be used. I found returning a list easier to work with (and size of the list is bounded if you feed the parser in chunks).

The change was to make every possible "blocking" parsing operation be in continuation passing style. Then the continueFun/* in erlsom_sax_lib check if we are in a pull parsing or not. If we are a pull parser, instead of asking the user for more data, it just return a function from where caller can continue parsing:

continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) ->
    {continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2,
    V3, State#erlsom_sax_state{user_state = []}, K) end,
         lists:reverse(State#erlsom_sax_state.user_state)};

(when pull parsing, the sax callbacks just accumulate the events in the user_state, so here we return them on the right order)

For transforming to CPS, the code is now allocating more funs while parsing. Initial test show little performance degradation, and any work done by the user with the parser events would make the difference in parsing costs irrelevant, but need more complete tests to assert that.

I only tested the utf8 backend, others should work the same.

What's your opinion?

willemdj commented 8 years ago

Hello Pablo,

Very interesting.

At first sight I would say that it might be even nicer if you could really 'pull' the next event, in stead of getting a list of all events. Do you think that would be possible?

I don't understand your point about hibernation, can you give an example?

Regards, Willem

On Mon, Jan 4, 2016 at 5:33 PM, Pablo Polvorin notifications@github.com wrote:

I wonder if you will be willing to merge something in the lines of this patch. (In such case this will need more tests and performance comparison)

The basic idea is:

  • Backward compatible, actual api is unchanged
  • Provide new, small api for allowing pull-parsing (erlsom:parse_pull/2).
  • Both use the same underling parsing code

For pull parsing, I mean:

3> {continue, C1, E1} = erlsom:parse_pull(<<"">>, [{output_encoding, utf8}]). {continue,#Fun, [startDocument, {startPrefixMapping,"s","http://some-namespace.com"}, {startElement,"http://some-namespace.com","xstream","s", [{attribute,"version",[],[],<<"1.0">>}, {attribute,"as",[],[],<<"234">>}]}]} 4> {continue, C2, E2} = C1(<<"ss">>). {continue,#Fun, [{startElement,[],"a",[], [{attribute,"href",[],[],<<"abc">>}]}, {startElement,[],"b",[],[]}, {characters,<<"ss">>}, {endElement,[],"b",[]}, {endElement,[],"a",[]}]} 6> {ok, E4, _Rest} = C3(<<"/s:xstream tail data">>). {ok,[{endElement,"http://some-namespace.com","xstream","s"}, {endPrefixMapping,"s"}, endDocument], " tail data"}

That is, allow the user of the parser to pass the data in chunks to the parser, instead of the parser asking for more in a callback.

This allow simpler integration for long running parsing (think in infinite xml streams coming from network), and also allows hibernating the parsing process, that on current version is not always possible (as hibernate discard the stack).

Returning the list of events instead of using sax callbacks is mostly a matter of preference, sax callbacks can still be used. I found returning a list easier to work with (and size of the list is bounded if you feed the parser in chunks).

The change was to make every possible "blocking" parsing operation be in continuation passing style. Then the continueFun/* in erlsom_sax_lib check if we are in a pull parsing or not. If we are a pull parser, instead of asking the user for more data, it just return a function from where caller can continue parsing:

continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) -> {continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2, V3, State#erlsom_sax_state{user_state = []}, K) end, lists:reverse(State#erlsom_sax_state.user_state)};

(when pull parsing, the sax callbacks just accumulate the events in the user_state, so here we return them on the right order)

For transforming to CPS, the code is now allocating more funs while parsing. Initial test show little performance degradation, and any work done by the user with the parser events would make the difference in parsing costs irrelevant, but need more complete tests to assert that.

I only tested the utf8 backend, others should work the same.

What's your opinion?

You can view, comment on, or merge this pull request online at:

https://github.com/willemdj/erlsom/pull/51 Commit Summary

  • Transform the parser to CPS. Support pull parsing

File Changes

Patch Links:

— Reply to this email directly or view it on GitHub https://github.com/willemdj/erlsom/pull/51.

ppolv commented 8 years ago

Hello, thanks for your time!. IMHO, for pulling events one-by-one, more intrusive changes are needed, as would imply that wrapCallback/2 could return the control to the caller (instead of only the continueFun* in erlsom_sax_lib). I, lazy as I'm, preferred to touch the less possible and reuse the same code for both parsing styles. The API for that could be easily provided based on this one, but of course that will still require all events of a "chunk" to be build in memory. So if you want to parse a big file using that "pull" interface, feeding the parser in chunks rather than the full file content would be a must.

Regarding hibernation, see the code at the end. The scenario in mind is where the parsing is over a long period of time, with data read from the network. Basically an endless stream (think in XMPP for example). If the caller is able to feed the parser itself, it has more control on when/how to do that. It is also much easier to embed in an gen_server/gen_fsm (erlsom:parse_sax with continuation_fun is not, as it is a blocking call that won't return).

-module(phibernation).
-compile(export_all).

send(Pid, []) ->
    Pid ! done;
send(Pid, [D|Rest]) ->
    Pid ! {data, D},
    receive after 100 -> ok end, %%just to simulate a little delay.  On cases this is minutes between each chunk comming from client side
    io:format("\nParser process memory usage: ~p\n", [erlang:process_info(Pid, memory)]),
    send(Pid, Rest).

test() ->
    io:format("\n*** Normal run\n"),
    test(fun() -> loop(false) end),
    io:format("\n*** Hibernate run\n"),
    test(fun() -> loop(true) end),
    io:format("\n*** Sax with continuation_fun\n"),
    test(fun() -> loop_with_callback(false) end),
    io:format("\n*** Sax with continuation_fun, hibernate (fail)\n"),
    test(fun() -> loop_with_callback(true) end),
    ok.

test(LoopFun) ->
    Data = [<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>, 
            <<"<a href='ab">>,
            <<"c'><b>ss</b></a>">>,
            <<"<a href='ab">>,
            <<"c'><b>ss</b></a>">>],
    Pid = spawn(LoopFun),
    send(Pid, Data).

loop(ShouldHibernate) ->
    loop(undefined, ShouldHibernate).
loop(Parser, ShouldHibernate) ->
    receive
        {data, Data} ->
            {continue, Parser2, Events} = case Parser of
                                               undefined -> erlsom:parse_pull(Data, [{output_encoding, utf8}]);
                                               _ -> Parser(Data)
                                          end,
            io:format("Parsed ~p  Events: ~p", [Data, Events]),
            case ShouldHibernate of
                %% Hibernate the process to reduce memory usage, since we expect that further data won't come in for a while
                true ->  erlang:hibernate(?MODULE, loop, [Parser2, ShouldHibernate]);
                false -> loop(Parser2, ShouldHibernate)
            end;
        done ->
            ok
    end.

loop_with_callback(ShouldHibernate) ->
    SaxCallbackState = [],
    F = fun(Ev, _) -> io:format("~p" ,[Ev]) end,
    Continuation = case ShouldHibernate of
                       true -> fun continuation_hibernated/2;
                       false -> fun continuation/2
                   end,
    erlsom:parse_sax(<<>>, SaxCallbackState, F, [{continuation_function, Continuation, []}, {output_encoding, utf8}]).

continuation(Tail, State) ->
    receive
        {data, Data} ->
            {<<Tail/binary, Data/binary>>, State}
    end.

continuation_hibernated(Tail, State) ->
    %% This won't work.  erlang:hibernate discard the call stack, (no erlsom:parse_sax/4 is left) so when this process wakeups
    %% continuation/2 will be called and the process ends.
    erlang:hibernate(?MODULE, continuation, [Tail, State]).

Results (events prints removed for clarity) :

*\ Normal run Parser process memory usage: {memory,8688} Parser process memory usage: {memory,8688} Parser process memory usage: {memory,13576} Parser process memory usage: {memory,13576} Parser process memory usage: {memory,16592}

*\ Hibernate run Parser process memory usage: {memory,2520} Parser process memory usage: {memory,2288} Parser process memory usage: {memory,2560} Parser process memory usage: {memory,2288} Parser process memory usage: {memory,2560}

*\ Sax with continuation_fun Parser process memory usage: {memory,10560} Parser process memory usage: {memory,10560} Parser process memory usage: {memory,13576} Parser process memory usage: {memory,13576} Parser process memory usage: {memory,8688}

*\ Sax with continuation_fun, hibernate (fail) Parser process memory usage: undefined Parser process memory usage: undefined Parser process memory usage: undefined Parser process memory usage: undefined Parser process memory usage: undefined

One instead of hibernate could force GC, but isn't the same

willemdj commented 8 years ago

Hi Pablo,

Just a short note to tell you that I did not forget about your idea.

I agree that support for hibernation would be nice, and the idea of a continuation is also nice. But I still don't like the idea of getting a list of events, for me that goes against the idea of erlsom, which (in my mind) is to look at sax events one by one.

So I would like to think about this a bit more when I have some more time.

(To get a bit more feeling for the idea I wrote the small module below: this is a simple way to use the existing parser as the basis for a pull parser)

-module(pull_parser).

-export([start/1, pull/1]).

%% export for use by this module only
-export([parse/1]).

%% starts a new process that parses up to the next event and then
%% waits until it receives a pull-request
start(Xml) ->
  spawn(?MODULE, parse, [Xml]).

pull(Parser) ->
  Parser ! {pull, self()},
  receive
    Event ->
      Event
  end.

%% internal functions

parse(Xml) ->
  erlsom:parse_sax(Xml, [], fun callback/2).

callback(endDocument, State) ->
  State;
callback(Event, State) ->
  receive
    {pull, Client} ->
       Client ! Event,
       State
  end.

Regards, Willem