zepheira / amara

Amara 2.0. Amara XML toolkit is an open-source collection of Python tools for XML processing, not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python over other programming languages.
http://wiki.xml3k.org/Amara2
Apache License 2.0
23 stars 9 forks source link

Amara 2.0.0_alpha6 XML bindery leaks memory #19

Open kcgen opened 10 years ago

kcgen commented 10 years ago

bindery.parse(...) leaks roughly 200x more memory than the size of the XML document that it's parsing. A 1KB document leaks about 200KB of RAM, which quickly adds up for long running processes performing repeated parsing.

In addition, each parse(...) call adds un-collectable objects to Python's variable reference tracking system, and adds O(1) CPU processing time overhead - perhaps because of the ever-growing reference-tracking load.

How to reproduce:

Python 2.7.6 (default, May  7 2014, 07:34:22) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc 
>>> len(gc.get_objects())
3803
>>> from amara import bindery
>>> len(gc.get_objects())
12113
>>> bindery.parse(open('test.xml'))
<entity at 0xb2ee60: 1 children>
>>> len(gc.get_objects())
12285
>>> bindery.parse(open('test.xml'))
<entity at 0xbed710: 1 children>
>>> len(gc.get_objects())
12832
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6f80: 1 children>
>>> len(gc.get_objects())
12764
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6050: 1 children>
>>> len(gc.get_objects())
13297
>>> bindery.parse(open('test.xml'))
<entity at 0xbf3830: 1 children>
>>> len(gc.get_objects())
13229
>>> bindery.parse(open('test.xml'))
<entity at 0xbf37a0: 1 children>
>>> len(gc.get_objects())
13762
>>> bindery.parse(open('test.xml'))
<entity at 0xcd60e0: 1 children>
>>> len(gc.get_objects())
13707
>>> bindery.parse(open('test.xml'))
<entity at 0xbf6050: 1 children>
>>> len(gc.get_objects())
14240

Compare this to a well behaved parser:

Python 2.7.6 (default, May  7 2014, 07:34:22) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import xml.etree.cElementTree as ET
>>> len(gc.get_objects())
4289
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a0197b0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019e10>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019990>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019ed0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019870>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019a20>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019db0>
3849
>>> ET.parse(open('test.xml')).getroot(); len(gc.get_objects())
<Element '{http://www.w3.org/2005/Atom}feed' at 0x7fa03a019ae0>
3849
>>>