weso / CWR-DataApi

CWR-DataApi
MIT License
34 stars 29 forks source link

Change the parser library #176

Open Bernardo-MG opened 7 years ago

Bernardo-MG commented 7 years ago

The main problem for the project is the parsing library, which is good for small projects, and not for parsing huge files.

This article contains a list of Python parsers: https://tomassetti.me/parsing-in-python/#parserGenerators

The library should support a BNF grammar, which should be easy to create from the CWR specification.

Note that the list includes ANTLR, which does support these grammars.

I also have experience with Ply, but does not seem like a good option for complex grammars.

These projects can be useful as references, as they are my own tests with parsers: https://github.com/Bernardo-MG/dice-notation-java https://github.com/Bernardo-MG/dice-notation-python

Based on all this, I think the best course of action would be:

This would require reworking the project, and probably dropping much of the current code.

Bernardo-MG commented 7 years ago

Preparing ANTRL grammar:

https://github.com/Bernardo-MG/cwr-grammar

This can be used to generate a Python parser. Validation rules should be applied to this parser.

Bernardo-MG commented 6 years ago

Checked current version of the ANTLR grammar against the test files. It parses them. Except for the 230MB. After increasing memory it is parsed in close to 3 minutes, but some errors are found.

Currently the ANTLR grammar splits the file into transactions and records. But the records are left as the full line, unprocessed.

The next step is generating the Python parser and adding it to the project.