Amara 2.0. Amara XML toolkit is an open-source collection of Python tools for XML processing, not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python over other programming languages.
bindery.parse(...) leaks roughly 200x more memory than the size of the XML document that it's parsing. A 1KB document leaks about 200KB of RAM, which quickly adds up for long running processes performing repeated parsing.
In addition, each parse(...) call adds un-collectable objects to Python's variable reference tracking system, and adds O(1) CPU processing time overhead - perhaps because of the ever-growing reference-tracking load.
How to reproduce:
Python 2.7.6 (default, May 7 2014, 07:34:22)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> len(gc.get_objects())
3803
>>> from amara import bindery
>>> len(gc.get_objects())
12113
>>> bindery.parse(open('test.xml'))
<entity at 0xb2ee60: 1 children>
>>> len(gc.get_objects())
12285
>>> bindery.parse(open('test.xml'))
<entity at 0xbed710: 1 children>
>>> len(gc.get_objects())
12832
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6f80: 1 children>
>>> len(gc.get_objects())
12764
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6050: 1 children>
>>> len(gc.get_objects())
13297
>>> bindery.parse(open('test.xml'))
<entity at 0xbf3830: 1 children>
>>> len(gc.get_objects())
13229
>>> bindery.parse(open('test.xml'))
<entity at 0xbf37a0: 1 children>
>>> len(gc.get_objects())
13762
>>> bindery.parse(open('test.xml'))
<entity at 0xcd60e0: 1 children>
>>> len(gc.get_objects())
13707
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6050: 1 children>
>>> len(gc.get_objects())
14240
Compare this to a well behaved parser:
Python 2.7.6 (default, May 7 2014, 07:34:22)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import xml.etree.cElementTree as ET
>>> len(gc.get_objects())
4289
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a0197b0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019e10>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019990>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019ed0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019870>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019a20>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019db0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019ae0>
3849
>>>
bindery.parse(...) leaks roughly 200x more memory than the size of the XML document that it's parsing. A 1KB document leaks about 200KB of RAM, which quickly adds up for long running processes performing repeated parsing.
In addition, each parse(...) call adds un-collectable objects to Python's variable reference tracking system, and adds O(1) CPU processing time overhead - perhaps because of the ever-growing reference-tracking load.
How to reproduce:
Compare this to a well behaved parser: