wwpdb-dictionaries / mmcif_pdbx

wwPDB PDBx/mmCIF Dictionary
Creative Commons Zero v1.0 Universal
9 stars 9 forks source link

Boundary conditions - e.g. per-shell R and CC values #29

Open CV-GPhL opened 3 years ago

CV-GPhL commented 3 years ago

As far as I understand, the following boundary conditions are given in the current dictionary

Apart from the fact that there seems an inconsistency regarding the lower limit (inclusive/exclusive), where do these boundary conditions come from? They are not intrinsic in the formulae for computing these metrics: a poorly populated shell or one with a lot of negative intensity values (high anisotropy of significant data while computing these metrics using all data before data cut-off) could easily lead to perfectly valid negative values.

Another example is

which seems rather odd. A correlation coefficient should have a boundary condition of -1 <= x <= 1 (intrinsic to the way it is computed).

Maybe there is some confusion about the use of "Boundary condition" and the "Advisory Boundary condition". I would have thought that the "Boundary condition" describes what range of values a particular item can take due to the formula (see CC above) or the nature of the item (a "number of reflections" can not be negative etc). As a complement to this, the "Advisory Boundary condition" can provide smaller boundaries in order to provide more guidance what is commonly seen as a sensible range (for whatever reason). So we have for example

Questions:

  1. Should "Boundary Conditions" describe the whole possible range of values for a given item (without much judgement about what a "sensible" value should be)?
  2. Where do the "Advisory Boundary condition" come from? Are they based on some article/paper - or is there another document that describe the thinking behind them?
  3. Is there maybe a bit of a mix-up between the different roles of those two types of boundary conditions? Clearly, a correlation coefficient limit of 0.0 is due to some advisory thinking at this point, right?
berrisfordjohn commented 3 years ago

Boundary and advisory limits are used during deposition and annotation to ensure that the values that we receive are valid. Advisory limits are used to display warnings to depositors and annotators during deposition / annotation. Boundary limits are used to limit the values that we receive during deposition - we do not allow values which exceed these values.

Take pH as an example: http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_exptl_crystal_grow.pH.html The boundary limits are the limits of pH and the advisory limits are the values we expect most experiments to fall within.

We are looking at adding a boundary limit to Rmerge to prevent users entering values of 10 instead of 0.10. It has taken the biocurators several months to clean up values which were incorrect. Another example is http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_reflns_shell.percent_possible_obs.html where we recently introduced a limit of 100% to prevent values of greater than 100% coming in, which we were routinely getting several a month.

The OneDep system does parse values from uploaded files - so this problem is less pronounced for categories which are typically in uploaded coordinate files - such as refinement statistics from _refine.

CV-GPhL commented 3 years ago

It makes complete sense to have a "Boundary condition" of 0 <= pH <= 14 because the definition of a pH value does not allow it to be negative or above 14. It also makes sense to have an "Advisory boundary condition" of 3.5 <= pH <= 10.0 to provide some feedback mechanism (e.g. during deposition) for potentially incorrect or at least unusual values.

Going back to reflection data, both upper boundary conditions for _reflns_shell.Rmerge_I_obs make sense. My question concerning per-shell merging R values is why they have a lower "Boundary condition" of 0.0 when there is no reason that they should always be above 0.0. It's as if the pH boundary conditions would be 1.0 <= pH <= 14. The case of _reflns_shell.pdbx_CC_half is even more pronounced.

Anyway, if I understand your pH example correctly then the "Boundary condition" is describing the theoretical limits of a particular value and the "Advisory boundary condition" provides for some sanity checking during deposition. Therefore the above examples should most likely be modified to become

(the lower "Boundary condition" for the R-values is maybe debatable, although I would prefer allowing for a certain negative range). If there is the (unfortunate and unenviable) need to distinguish between values given in % or not at deposition, this should probably be part of the "Advisory boundary condition" (which looks like a PDBx extension, _pdbx_item_range) and maybe not placed into the "Boundary condition" (_item_range) which seems an original mmCIF item.

Life would probably be easier if no values were ever given in percentages (which sometimes gives the impression a value could only ever be between 0 and 100%)... but storing integers was just so much cheaper back in the days I guess ;-)

drlemmus commented 3 years ago

It makes complete sense to have a "Boundary condition" of 0 <= pH <= 14 because the definition of a pH value does not allow it to be negative or above 14. Well actually.... (https://pubs.acs.org/doi/pdf/10.1021/ed083p1465)

Anyway, boundary conditions should be for impossible values. The advisory limits a from a model/data interpretation point of view much more important.

epeisach commented 3 years ago

While some of the limits may seem odd, they are based partially on the data in the archive. For instance Rmerge on I for reflns_shell - we have a value of 116 in 6QT6. This agrees with the publication in which "sensible" Rmerge is reported for other datasets - and this one is radiation induced damaging of the crystal. Knowing the PI - I do not believe this was an error of percentage and off by 100.

Inclusion of zero boundary limits might be in error - but we need to ensure that we do not have zero's hidden away in the archive - instead of ? for missing data - so they need to be checked one by one.

Clearly more cleanup can and will continue into the future.

With regards to advisory limits, these were based on statistical analysis of data in the archive. It is used to suggest unusual values that might be in error in manual data entry - but we did not wish to block someone who does have that unusual case.

dagewa commented 3 years ago

Does the "observed criterion" of _reflns_shell.Rmerge_I_obs ensure that its lower possible limit is not -∞? That is, observed reflections are not negative*, so the denominator cannot be negative and 0.0 is a safe boundary.

*This is making an assumption about valid definitions of the observed criterion. I don't know if such assumptions are safe.

Likewise, _reflns.threshold_expression may do the same for _reflns_shell.Rmerge_I_gt.

I agree that _reflns_shell.Rmerge_I_all should have boundaries -∞ < x < +∞