wwpdb-dictionaries / mmcif_pdbx

wwPDB PDBx/mmCIF Dictionary
Creative Commons Zero v1.0 Universal
9 stars 9 forks source link

Deposition, preservation and presentation of map coefficients #11

Open CV-GPhL opened 4 years ago

CV-GPhL commented 4 years ago

See: PDBx/mmCIF: BUSTER electron density map (deposition, display etc)

The map coefficients (for a multitude of different maps) provided during deposition should not only be preserved within the structure factor files, but probably also be made available to non-expert users - alongside additional maps computed via a standardized and validated procedure.

This might not seem directly related to the PDBx/mmCIF dictionary, but different types of maps are currently put into the FWT/DELFWT items even if they are of a very different types, e.g. F(early)-F(late) radiation-damage detection maps from autoPROC and BUSTER. The way data is put into the appropriate mmCIF items is directly related to the way this data is made accessible to end users outside a simple file download mechanism.

epeisach commented 4 years ago

The current dictionary limits the map coefficients, but we should do better. We should ensure that the diffraction extension can support this in a distinct manner.

CV-GPhL commented 4 years ago

If the diffraction extension could look at the way MTZ files are organised - with a nearly arbitrary number of columns of arbitrary name, but with the column type defining the content type (H=Miller, P=phase, F=amplitude etc) - this might simplify things while providing flexibility. See here.

The clipper way (using a hierarchical description, together with column grouping with a "." notation): See e.g. here .

epeisach commented 4 years ago

The issue is then your are back to someone reading a textual description of what a column "FMap" might mean and we lose a machine readable and interpretable file. There have been many cases in which we have had to "guess" the column used for amplitudes based on type as refinement packages did not indicate which data they were using. In your above example, if you have two amplitude columns - which was used for refinement? And yes - we have received files in this manner.

I am hoping we can come up with a solution in which there is nothing ambiguous. For instance, say you have a "FMap", "PhaseMap" column - but the type of dataset indicates that this datablock is a "F(early)-F(late)" map - then you can have arbitrary maps - as long as we register the type.

CV-GPhL commented 4 years ago

Machine readability is absolutely crucial, yes.

Maybe all this is already there (or is technically impossible), but I was thinking/fantasizing about something like this

_loop_ _refln_items.ordinal _refln_items.type _refln_items.name 1 amplitude 'Fnative' 2 sigma 'SIGFnative' 3 fom 'FOM' 4 'map coefficient, amplitude' '2FOFCWT' 5 'map coeffcient, phase' 'PH2FOFCWY' 6 'difference map coefficient, amplitude' 'Fearly-late' 7 'difference map coefficient, phase' 'Pearly-late' 8 'amplitude' 'FP' 9 'sigma' 'SIGFP'_

I.e. something that associates type and (optional) name with each column in the _refln loop). The we can specify what was used in refinement:

... _refine.amplitude_used 8 _refine.sigma_used 9 ...

and use a generic reflection/data loop:

loop_ _refln.item_1 _refln.item_2 _refln.item_3 ... _refln.item_9 (that last bit is probably nonsense and won't work anyway I guess).

Anyway, that might all not be possible or too different - but it feels as if combining some of the benefits of MTZ files (there is a reason we all keep using them for reflection data - at least for now) and the much tighter internal associations of mmCIF could be useful.

Just some (random) ideas/comments ... nothing urgent ;-)

epeisach commented 4 years ago

Technically - there is an issue here. You limit yourself to refln.item_X - and we have to define a maximum.

I am not ready to commit to this approach - but you one "solution" might be loop_ _refln_data.h _refln_data.k _refln_data.l _refln_data.values "<rows with # items wide>"

This would allow for "unlimited" number of columns - but we lose the ability to validate the values - as we need to split into columns breaking at spaces, etc.