wwpdb-dictionaries / mmcif_pdbx

wwPDB PDBx/mmCIF Dictionary
Creative Commons Zero v1.0 Universal
9 stars 9 forks source link

regular expression problems #13

Closed mhekkel closed 4 years ago

mhekkel commented 4 years ago

Two of the regular expressions in the current version of the library cause trouble. One is clearly in error the other is living in 'undefined behaviour' space.

The error is in deposition_email which contains the construct: [a-zA-Z0-9-.] I assume the author intended this to have the ranges a-z, A-Z, 0-9 and then the characters - and . However, the standard explicitly states that if you want to include a single hyphen, it should be the first character after the opening bracket.

The second error is in 3x4_matrices. Here we see the construct {3}? Some regular expression libraries choke on this and it could easily be replaced with {0,3} which is a very nice replacement.

Hope this can be fixed.

epeisach commented 4 years ago

Thank you for the feedback. I agree with the ambiguity in the hyphen case.

With regards to the {3} -- what is the proper syntax if you would like exactly three copies. Would {3,3} be allowed?

epeisach commented 4 years ago

Actually - I am seeing another specification that states that hyphen should be escaped... However, POSIX standard indicates "To match a -, put it right before the closing ]" - so you are saying after [ - and this one says before ].

I found another that states: "The - character is treated as a literal character if it is the last or the first character within the brackets: [abc-], [-abc]."

Which is correct?

Extended Regular Expression Syntax states that {m} matches exactly {m} times. Basic regular Expression (BRE) indicates {m}

For reference - I am using https://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html

mhekkel commented 4 years ago

Op 29-06-2020 om 23:06 schreef Ezra Peisach:

Actually - I am seeing another specification that states that hyphen should be escaped... However, POSIX standard indicates "To match a -, put it right before the closing ]" - so you are saying after [ - and this one says before ].

I found another that states: "The - character is treated as a literal character if it is the last or the first character within the brackets: [abc-], [-abc]."

Which is correct?

Sorry, I mixed up two cases. A closing bracket should be the first if it is to be included. A hyphen should be first or last. So if both a bracket and a hyphen are to be included the hyphen should obviously be last.

Extended Regular Expression Syntax states that {m} matches exactly {m} times. Basic regular Expression (BRE) indicates {m}

The construct {0,3} would allow either zero or three occurrences. Which is a valid POSIX regular expression as far as I know. You could test it with your tools to be sure it is accepted. The construct {3}? is certainly not accepted by many of the libraries I tested.

-maarten

--

Maarten L. Hekkelman Cataloniëstraat 3 6663NJ Lent

http://www.hekkelman.com/

epeisach commented 4 years ago

Missed the "?" part of {3}?. Looks like it works in python - but that is not everything. Let me look at it some more. Luckily in developing this regular expression - I wrote some tests so I can apply in various instances.

epeisach commented 4 years ago

Actually, can you indicate what language you are using to test the regular expression with? It would be useful to add to automated testing.

mhekkel commented 4 years ago

I'm using C++ with various regex libraries, like the one from Boost and the version of GNU C++.

You could test version 9.3 of the stdc++ lib from GNU which is the one I'm currently using.

Op 30-06-2020 om 09:55 schreef Ezra Peisach:

Actually, can you indicate what language you are using to test the regular expression with? It would be useful to add to automated testing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wwpdb-dictionaries/mmcif_pdbx/issues/13#issuecomment-651619739, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNA47D4K6MU5QCLMAZ4ATLRZGLANANCNFSM4OK7ITUQ.

epeisach commented 4 years ago

I still need to handle this. This was a painful regular expression to form. I will handle this week.

wojdyr commented 4 years ago

There are more problems with regular expression. It's kind of unavoidable, because the syntax supported by various libraries differs. If you use these regexes in software one thing to check is the presence of backslash in character classes. I think the intention of /\{} in character classes is that \ is included, but the libraries I tried treat is just as an escape character.

epeisach commented 4 years ago

I now have a test that will check using the original C regex definition, as well as python. The current regex passes python - but not C based one. Our deposition system uses python for regular expression testing. I have the original tests I used when developing this regex - so I have tests.

wojdyr commented 4 years ago

Did you check \? For me it doesn't match in Python. Here is the regex for line:

>>> re.match(r'''[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*''', '\\')
<_sre.SRE_Match object; span=(0, 0), match=''>
epeisach commented 4 years ago

I did not test for functionality of python or C - just that it compiles.

Anyways - 3x4matrices updated in master branch. Tests are included. (will need to " git submodule update --init " then cd tests/regextest; mkdir build; cd build; cmake .. ; make

So that should be resolved.

What you are indicating a possible issue with python regular expressions. I guess it is fortunate that back slashes are not used in general in our deposition system in which backslash could be used and uses python regular expression processing. You need to prepare the regular expression properly.

See: https://docs.python.org/3.8/howto/regex.html#the-backslash-plague

I think this particular issue could be closed - and we can open a separate on use of regular expression in which literal backslash is allowed.

wojdyr commented 4 years ago

It is not an issue with Python regular expressions, but with regular expressions in the mmCIF dictionary. But it's not particularly important, so I think it's fine to just leave it as it is now.

Regarding backslashes, they are used in many places in mmCIF strings, for example:

'O to N distance > 2.5 \%A, < 3.2 \%A'
"O[C@@H](CC(O)=O)\C=C\CCS"
mhekkel commented 4 years ago

I just checked out the new version and my validation code no longer complains about the regular expression. That's good.

But now a lot of item_type fields are missing?

May I suggest you have a look at my cif-validator tool which is part of the cif-tools at https://github.com/PDB-REDO/cif-tools This might be useful in the work you do.

wojdyr commented 4 years ago

The items without _item_type.code have parent items, so they have the same types are their parents.

mhekkel commented 4 years ago

I see... and now that I have a closer look at the file I notice that _pdbx_item_linked_group_list has gone.

Without notice, nor a mention in the list of Changes.

weird.

-maarten

Op 28-09-2020 om 08:47 schreef Marcin Wojdyr:

The items without |_item_type.code| have parent items, so they have the same types are their parents.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wwpdb-dictionaries/mmcif_pdbx/issues/13#issuecomment-699811085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNA47FCV3VTGLOZLBJ6WD3SIAWPJANCNFSM4OK7ITUQ.

-- Maarten L. Hekkelman Cataloniëstraat 3 6663NJ Lent

http://www.hekkelman.com/ +31 24 348 0192

epeisach commented 4 years ago

_pdbx_item_linked_grou and _pdbx_item_linked_group_list was never in this GitHub repository - only from the mmcif.wwpdb.org site. Because of the way we produce our dictionary - using dict-pack - this was a separate set of definitions.

These categories provide a framework to provide parent/child tuples for matching. For instance knowing that chain A and residue 56 are present in parent instead of limiting to chain A in parent and residue 56 in the parent- which is not completely sufficient. (could be B56 present)

That said - it should be here - to give a complete dictionary from one source. I will make this change.

epeisach commented 4 years ago

I believe this is resolved in latest dictionary. Please check.

Ezra

mhekkel commented 4 years ago

Yup, thanks