Open Bernardo-MG opened 7 years ago
Preparing ANTRL grammar:
https://github.com/Bernardo-MG/cwr-grammar
This can be used to generate a Python parser. Validation rules should be applied to this parser.
Checked current version of the ANTLR grammar against the test files. It parses them. Except for the 230MB. After increasing memory it is parsed in close to 3 minutes, but some errors are found.
Currently the ANTLR grammar splits the file into transactions and records. But the records are left as the full line, unprocessed.
The next step is generating the Python parser and adding it to the project.
The main problem for the project is the parsing library, which is good for small projects, and not for parsing huge files.
This article contains a list of Python parsers: https://tomassetti.me/parsing-in-python/#parserGenerators
The library should support a BNF grammar, which should be easy to create from the CWR specification.
Note that the list includes ANTLR, which does support these grammars.
I also have experience with Ply, but does not seem like a good option for complex grammars.
These projects can be useful as references, as they are my own tests with parsers: https://github.com/Bernardo-MG/dice-notation-java https://github.com/Bernardo-MG/dice-notation-python
Based on all this, I think the best course of action would be:
This would require reworking the project, and probably dropping much of the current code.