zadean / yaccety_sax

Fast, StAX-like XML Parser for BEAM Languages
Apache License 2.0
32 stars 2 forks source link

How to replace one value with another? #1

Open tomekowal opened 5 years ago

tomekowal commented 5 years ago

Hey! There is no documentation and we would like to try it. Our use case is that we want to modify elements based on their contents. In example reverse contents of tag/subtag

<tag>
  <subtag>asdf</subtag>
  <subtag>qwer</subtag>
  <subtag>asdf</subtag>
<tag>
<tag>
  <subtag>fdsa</subtag>
  <subtag>rewq</subtag>
  <subtag>fdsa</subtag>
<tag>
zadean commented 5 years ago

@tomekowal, Thanks for the issue and interest. I've been a bit busy, but I hope to get to this next week.

I'll make the project "rebar-able" and add some documentation/examples for handling individual nodes and transforming document streams.

zadean commented 5 years ago

@tomekowal , So, before making a bunch of docs that will need to changed. I'd like to see if this makes sense to you:

It's a more "procedural" example for ease of following the flow. Mind you, the API isn't near final, but the parser will always be similar to an iterator, and there will be a writer as well as a reader. So that won't change. Maybe just the names. :-)

run() ->
    Input = <<"<tag>\n  <subtag>asdf</subtag>\n  <subtag>qwer</subtag>\n  "
              "<subtag>asdf</subtag>\n</tag>">>,
    State = stax:stream(Input, [{whitespace, false}]),
    % fake it for now until there is a serialization API
    OutState = {<<>>, #{}},

    % read and assert the startDocument event, write it out
    {#{type := startDocument} = E1, State1} = stax:next_event(State),
    OutState1 = stax:write_event(E1, OutState),

    % read and assert the startElement event for the "tag" tag, write it out
    {#{type  := startElement,
       qname := {<<>>, <<>>, <<"tag">>}} = E2, State2} = stax:next_event(State1),
    OutState2 = stax:write_event(E2, OutState1),

    {State3, OutState3} = reverse_subtag(State2, OutState2),
    {State4, OutState4} = reverse_subtag(State3, OutState3),
    {State5, OutState5} = reverse_subtag(State4, OutState4),

    % read and assert the endElement event for the "tag" tag, write it out
    {#{type  := endElement,
       qname := {<<>>, <<>>, <<"tag">>}} = E3, State6} = stax:next_event(State5),
    OutState6 = stax:write_event(E3, OutState5),

    % read and assert the endDocument event, write it out
    {#{type := endDocument} = E4, _State7} = stax:next_event(State6),
    {Output, _} = stax:write_event(E4, OutState6),

    Output.

reverse_subtag(State, OutState) ->
    case stax:next_event(State) of
        % the 'subtag' opening tag
        {#{type := startElement} = E1, State1} ->
            OutState1 = stax:write_event(E1, OutState),
            reverse_subtag(State1, OutState1);
        % the text to change
        {#{type := characters,
           data := Sub} = E1, State1} ->
            OutState1 = stax:write_event(E1#{data := do_flip(Sub)}, OutState),
            reverse_subtag(State1, OutState1);
        % the 'subtag' closing tag, so return
        {#{type := endElement} = E1, State1} ->
            OutState1 = stax:write_event(E1, OutState),
            {State1, OutState1}
    end.

do_flip(Text) ->
    Chs = [T || <<T/utf8>> <= Text],
    Rev = lists:reverse(Chs),
    << <<C/utf8>> || C <- Rev >>.
tomekowal commented 5 years ago

Seems clear. I just realised that there is no Enum.reduce in Erlang, only foldl and foldr on lists, so the recursive bits need to be written by hand. Also, I think you can use string:reverse because it correctly groups things into grapheme clusters, but still retunrs io data (but that is outside of the discussion :))

zadean commented 5 years ago

Yeah... string:reverse doh! :-) Since I have no experience with Elixir, it would be interesting to see what the same example would look like with it. Also is the return type from the stax:next_event call, with {Event, State} easy enough, or should that be changed to something else?

tomekowal commented 5 years ago

Hey, I made an example elixir application that uses yaccety_sax https://github.com/tomekowal/yaccety_sax_test/blob/master/test/yaccety_sax_test_test.exs All the exciting stuff is in the test file. The first test is what you pasted above rewritten in Elixir. The second one is an example of using Elixir streams and Enum.reduce to work with it. The third one is again reversing example but using streams.

As you can see, the {Event, State} is perfect because stream generators expect exactly that format. {CurrentElement, StateToBuildNextElement}.

zadean commented 4 years ago

Cool! And great that the output format fits so well!

Time permitting, I'll try to finish the rest of the implementation (DTD, default attributes stuff, external references and entities, etc.).

Also documentation. :-)

How was the performance?? low memory footprint? fast enough?

tomekowal commented 4 years ago

Unfortunately, I didn't test it on anything more significant than that toy example. We don't have that many big XML files, anyway. For now, we settled on using :xmerl in our project. We will watch closely how this repo evolves :)