wwpdb-dictionaries / mmcif_pdbx

wwPDB PDBx/mmCIF Dictionary
Creative Commons Zero v1.0 Universal
9 stars 9 forks source link

Category PDBX_DIFFRN_MERGE_STAT_SHELL is isolated (and redundant?) #7

Open pkeller opened 4 years ago

pkeller commented 4 years ago

The category PDBX_DIFFRN_MERGE_STAT_SHELL:

The difference is that one category refers to a resolution shell from a data section, and the other to the whole data section. It seems to me that we don't need both -- resolution limits are needed in both cases anyway.

My initial suggestion:

  1. Add ordinal_id to PDBX_DIFFRN_MERGE_STAT and make it part of the category key
    • its type should be non_negative_int or positive_int rather than code
  2. Make _pdbx_diffrn_merge_stat.d_limit_high and _pdbx_diffrn_merge_stat.d_limit_low mandatory
    • I am sure that we all know better than to produce statistics of this type without specifying the resolution limits.
  3. Remove PDBX_DIFFRN_MERGE_STAT_SHELL.

Row(s) in this category that refer to a whole data section could be identified from the widest resolution range. If there isn't a single resolution range for a data section that encloses all the shells for the same data section, that is no different from the situation where the PDBX_DIFFRN_MERGE_STAT category is absent now, and if any action is needed it should be handled in the same way.

OTOH, if an explicit indicator for the row(s) in the category that refer to a whole data section is needed (rather than looking for the widest resolution range), this could be done in one of two ways:

  1. a documented convention that a specific value of ordinal_id indicates the whole data section. A suitable value might be one of:
    1. . with _item.mandatory_code no (if this is allowed for a key item - I've been out of this game too long and I forget :wink: ).
    2. 0 with a type of non_negative_int
    3. 1 with a type of positive_int
  2. an additional boolean item

Comments anyone?

CV-GPhL commented 4 years ago

Using "widest resolution range" is bound to create problems (rounding differences, "nice number" for overall values etc).

I can see the benefit of using a special ordinal_id value - and would go for the 0 value (to stay within the same type) in that case.

On the other hand, the typical ordering in all tables presented to users is (from the low-resolution end) 1,2,3,...,N and then (N+1) is the overall value.

How about adding an additional item into the PDBX_DIFFRN_MERGE_STAT category: data_coverage (or similar) with pre-defined values of "overall" (or "all" or "total") and "subset"? Then ordinal_id doesn't need to be overloaded with meaning, right?

staraniso_alldata-unique.stats.txt

epeisach commented 4 years ago

This is designed to parallel reflns and reflns_shell statistics. These are designed to be produced by software and we do not expect end users to be typing or cobbling together numbers from different tables in scaling output files - or do we? Let's say five years from now, when we implement this, someone is still using the original scalepack - will they be able to produce a table with overall vs specific shells?

Having different categories makes it easy to say - you provided a high resolution shell statistic - how about overall data - and how to find it - but personally I can go either way.

CV-GPhL commented 4 years ago

Isn't an overall value identical to a per-shell value? The actual metric (and computation) is the same - using reflections within two given resolution limits. Keeping two different categories for identically defined and computed values seems a slightly unnecessary duplication. Given the inconsistent item naming and definitions within REFLNS and REFLNS_SHELL I'm not sure this is a good enough reason to necessarily stick with this.

The handling of missing data at deposition (shell value(s) given, but no overall value) seems more closely related to actual deposition software and not necessarily impacting directly on the dictionary definitions. It should be easy for software to see if one of the shell values encompasses all other shells (and there is no overlap between those other ones - once we know if those limits are inclusive or not) and act in the same way as described above, I think.

epeisach commented 4 years ago

Per shell could be the entire resolution range.

Historically, _reflns_shell was intended for the highest resolution shell. With data harvesting with pdb_extract, parsing log files became easier (and I suspect xia2 does the same). Therefore, ranges are now accepted.

Ordering is a challenge. Do we go with low to high or high to low. If you are trying to determine the overall data statistics, would you want them mixed in here?