willemdj / erlsom

XML parser for Erlang
GNU Lesser General Public License v3.0
264 stars 103 forks source link

Fix for output_encoding, utf8 #12

Closed dLuna closed 12 years ago

dLuna commented 12 years ago

When using utf8 output_encoding, the old code would crash for a CDATA which was not flush with its surrounding tags.

There are numerous more ++ in the code of this module and I don't know enough to be able to reliably know whether some of those should also be replaced with a version that works on both binary and lists.

Feedback and comments very welcome.

willemdj commented 12 years ago

Hi,

I don't have much time right now, but I'll look into it.

Can you explain what you mean by "CDATA which was not flush with its surrounding tags"? Or provide an example?

Regards, Willem

dLuna commented 12 years ago

The following xsd and xml files will work if you put the <![CDATA flush with <whatnot> but not the way it is in the example below.

Save these examples as example.xsd and example.xml and run erlsom:scan(element(2, file:read_file("example2.xml")), element(2, erlsom:compile_xsd_file("example2.xsd")), [{output_encoding, utf8}]). and you will get the following crash. Remove [{output_encoding, utf8}] and it works. It is fully possible that the bug is in erlsom_sax_utf8.erl instead. There is a comment on line 862 that sort of makes me suspect that is the case. I don't understand the code base well enough to solve it that way.

** exception throw: {'EXIT',
                     {error,
                      [{exception,
                        {badarg,
                         [{erlang,'++',[<<"\n">>,<<"Testing">>],[]},
                          {lists,append,2,[{file,"lists.erl"},{line,63}]},
                          {erlsom_parse,stateMachine,2,
                           [{file,"src/erlsom_parse.erl"},{line,652}]},
                          {erlsom_parse,xml2StructCallback,2,
                           [{file,"src/erlsom_parse.erl"},{line,299}]},
                          {erlsom_sax_utf8,wrapCallback,2,
                           [{file,"src/erlsom_sax_utf8.erl"},{line,1364}]},
                          {erlsom_sax_utf8,parseContentLT,2,
                           [{file,"src/erlsom_sax_utf8.erl"},{line,864}]},
                          {erlsom_sax_utf8,parse,2,
                           [{file,"src/erlsom_sax_utf8.erl"},{line,196}]},
                          {erlsom,scan2,3,
                           [{file,"src/erlsom.erl"},{line,211}]}]}},
                       {stack,[{'#PCDATA',char,<<"\n">>},'top-type']},
                       {received,{characters,<<"Testing">>}}]}}
     in function  erlsom:scan2/3 (src/erlsom.erl, line 215)

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:simpleType name="whatnot-type">
    <xs:restriction base="xs:string" />
  </xs:simpleType>
  <xs:complexType name="top-type">
    <xs:all>
      <xs:element name="whatnot" type="whatnot-type"></xs:element>
    </xs:all>
  </xs:complexType>
  <xs:element name="top" type="top-type" />
</xs:schema>
<top><whatnot>
<![CDATA[Testing]]></whatnot></top>
willemdj commented 12 years ago

Thanks, I merged it to master.