xBimTeam / XbimEssentials

A .NET library to work with data in the IFC format. This is the core component of the Xbim Toolkit
https://xbimteam.github.io/
Other
506 stars 174 forks source link

recreate yacc and lex source code #586

Open santiagoIT opened 3 weeks ago

santiagoIT commented 3 weeks ago

Hello,

I have enhanced the string regular expresion used in StepP21Lex.lex to fix a problem that we have encountered a few times with certain IFC files.

So I ran the MAKEPARSER.BAT batch file to recreate the yacc and lex source files. But I had compile errors. So I undid my changes and ran the MAKEPARSER.BAT batch file without any changes. It seems to have run fine:

image

But the generated StepP21Lex.cs files has some changes in it that lead to compile errors:

Longs have been turned into ints: image

image

Also an ifdef is lost: image

Should all that be fixed manually or am I missing something or doing something wrong?

andyward commented 3 weeks ago

Yes, it's a hack from a while back. See https://github.com/xBimTeam/XbimEssentials/issues/561#issuecomment-2160556569

We should really look to replace this old PointsGarden parser

andyward commented 3 weeks ago

I meant to add - you should be able to git cherrypick -n 6517bc1 to re-apply the #6517bc16042b3cfd820dd7eb45f72bbab92d13ad fix to your local branch

santiagoIT commented 3 weeks ago

@andyward It was precisely the single backslash issue that I am trying to address. Hope to be able to try this out soon and hopefully all unit tests will pass. If so, I will submit a pull request. We run into this problem frequently.

I hope there are tests with the short unicode encoding, if not I will try to add them. I need to make sure that the regex I have does not break anything with that. If not, I will add some.

santiagoIT commented 2 weeks ago

unfortunately, the change I did to the regex broke some tests. I wanted the parser to be tolerant against non-correctly encoded strings. I ran into the EncodeBackslash() Test which is now disabled, and I can see that that is the way it used to work (fault tolerant) but it had to be changed.

I believe the correct approach would be to try to detect Invalid strings, by adding a new Token type (Tokens.STRING_INVALID) in the lex file. An exception could then be thrown specifying the line number and string, which would make it clear to the user why the file does not load. I know very little about the encoding of strings in IFC.

Are these the only valid encodings for IFC? https://technical.buildingsmart.org/resources/ifcimplementationguidance/string-encoding/

1) \S . No idea where this comes from 2) ‘\PA Are other code pages supported?

Basically what I am trying to come up with is a regex which can be used to detect invalid strings. This regex would be run before the regular string regex,

Thank you!