widesky / hszinc

Project Haystack Zinc (Zinc is not CSV) parser and dumper for Python 2.7 and 3.x
https://widesky.cloud
BSD 2-Clause "Simplified" License
10 stars 8 forks source link

Parsing zinc file is too slow #11

Open prajgujarathi opened 6 years ago

prajgujarathi commented 6 years ago

hszinc.parse(file_) take too much time to parse a simple zinc file. Is there a way to improve the performance of that function.

sjlongland commented 6 years ago

Can we define "too much time" and "simple zinc file"? Are we talking milliseconds, seconds, hours, days? Is the file 100 bytes, 100kB, 100MB? Also, which version of hszinc are you using? What Python environment? Is this running on an under-clocked 386SX/16MHz or an over-clocked Core i7/Rysen 7?

I'll admit I've done no real performance analysis on hszinc when dealing with ZINC grids, focussing on getting it working at all.

There was a re-write recently that replaced parsimonious with pyparsing as the underlying parsing workhorse: this was done because the grammar needed for the former was getting very complicated for maintenance purposes, and adding list support was made nearly impossible.

I tried numerous variations, but couldn't reliably get lists to parse correctly using parsimonious due to the recursive nature of the grammar required. Hence, I moved it over to pyparsing. I don't know if this made performance "better" or "worse": this was not a consideration at the time.

I'm open to reviewing patches that address the performance issues on the proviso they don't break existing functionality.

If performance is a consideration, I'd seriously consider using JSON for grid representation as that uses the standard json module in Python, and thus should have fewer performance issues.

prajgujarathi commented 6 years ago

Hi Stuart, thanks a lot for replying. About the size of a file for 3000kb zinc file, it takes about 10mints on core i7 processor. As you suggested using JSON format is quite faster than Zinc parsing. So, for now, I am going to use JSON for grids.

sjlongland commented 6 years ago

Ahh okay, yeah, 3MB is huge. My testing has been with grids that are more like 3kB. :-)

ZINC is challenging as while it is CSV-based, it is different enough from CSV that using a standard CSV parser will not yield reliable parsing results. Ultimately I'd like to speed things up with a C extension, but that can be a problem on some platforms without a C compiler and libraries. I figure "get it working right", then "get it working fast".

Soon as I get issue #8 finished, I'll have a "complete" ZINC parser with unit tests: and I can start re-implementing parts for speed as I'll have the unit tests to tell me whether I've accidentally broken something or not.

ChristianTremblay commented 5 years ago

Making some test comparing JSON to ZINC on Niagara4 and can tell it is slow. Too slow infact. For now on, I'll switch Niagara stuff to JSON by default.