wwpdb-dictionaries / mmcif_pdbx

wwPDB PDBx/mmCIF Dictionary
Creative Commons Zero v1.0 Universal
9 stars 9 forks source link

What is the canonical order of categories in an mmCIF file? #35

Closed samirelanduk closed 3 years ago

samirelanduk commented 3 years ago

The categories/tables seem to come in a set order in mmCIF files (I am referring specifically to those in the Protein Data Bank here). _entry is first, then _audit_conform, etc. Is the actual order of all 200 or so given anywhere? In all the documentation I can find, they are just given in alphabetical order.

I have tried to work this out from the PDB itself, but it's not straightforward as no file contains all categories. In fact I think it is impossible to do it this way because the order is not consistent between versions. In 1twj for example (dict version 5.281), pdbx_struct_assembly comes before pdbx_nonpoly_scheme but in 6qha (dict version 5.305) it is the other way around.

If I want to know the order of categories for a specific dict version (the most recent one, say), where can I get that information? Producing mmCIF files that match PDB ones is difficult to do without this information.

mhekkel commented 3 years ago

Hi Sam,

I'm not the expert here, but why do you want to know the order? It is not important programs consuming mmCIF. The only reason I see you want this order is to use a text editor to visually compare two mmCIF files.

If that is what you're looking for, you might want to have a look at the program cif-diff, part of our cif-tools (here in github at PDB-REDO/cif-tools). It reads in two mmCIF files, orders the second in the same order as the first and then compares the content. This can even be done with vimdiff (if you specify --text).

Sorry for the advertisement.

regards, -maarten

Op 29-01-2021 om 20:17 schreef Sam Ireland:

The categories/tables seem to come in a set order in mmCIF file (I am referring specifically to those in the Protein Data Bank here). |_entry| is first, then |_audit_conform|, etc. Is the actual order of all 200 or so given anywhere? In all the documentation I can find, they are just given in alphabetical order.

I have tried to work this out from the PDB itself, but it's not straightforward as no file contains all categories. In fact I think it is impossible to do it this way because the order is not consistent between versions. In 1twj for example (dict version 5.281), |pdbx_struct_assembly| comes before |pdbx_nonpoly_scheme| but in 6qha (dict version 5.305) it is the other way around.

If I want to know the order of categories for a specific dict version (the most recent one, say), where can I get that information? Producing mmCIF files that match PDB ones is difficult to do without this information.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/wwpdb-dictionaries/mmcif_pdbx/issues/35, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNA47AL7BMXSPEN2KOWQRLS4MCU3ANCNFSM4WZLKR2Q.

-- Maarten L. Hekkelman http://www.hekkelman.com/

samirelanduk commented 3 years ago

I need to produce files that match the public PDB files as closely as possible.

epeisach commented 3 years ago

There is an order - for many of the categories, but not all of them. We tend to put key categories early, and larger one later - but that is to aid manual viewing of the file. A CIF parser has to read the entire file. It is not clear why you need a file that matches public PDB files.

Here is the list of categories listed earlier in the file - it comes from the cifexch2 program in the dictionary software.

"entry", "audit", "audit_conform", "database", "database_2",
"database_PDB_rev", "database_PDB_rev_record",
"pdbx_database_PDB_obs_spr", "pdbx_database_related",
"pdbx_database_status", "pdbx_database_proc", "audit_contact_author",
"audit_author", "citation", "citation_author", "citation_editor", "cell",
"symmetry", "entity", "entity_keywords", "entity_name_com",
"entity_name_sys", "entity_poly", "entity_poly_seq", "entity_src_gen",
"entity_src_nat", "pdbx_entity_src_syn", "entity_link", "struct_ref",
"struct_ref_seq", "struct_ref_seq_dif", "chem_comp", "pdbx_nmr_exptl",
"pdbx_nmr_exptl_sample_conditions", "pdbx_nmr_sample_details",
"pdbx_nmr_spectrometer", "pdbx_nmr_refine", "pdbx_nmr_details",
"pdbx_nmr_ensemble", "pdbx_nmr_representative", "pdbx_nmr_software",
"exptl", "exptl_crystal", "exptl_crystal_grow",
"exptl_crystal_grow_comp", "diffrn", "diffrn_detector",
"diffrn_radiation", "diffrn_radiation_wavelength", "diffrn_source",
"reflns", "reflns_shell", "computing", "refine", "refine_analyze",
"refine_hist", "refine_ls_restr", "refine_ls_restr_ncs",
"refine_ls_shell", "pdbx_refine", "pdbx_xplor_file", "struct_ncs_oper",
"struct_ncs_dom", "struct_ncs_dom_lim", "struct_ncs_ens",
"struct_ncs_ens_gen", "struct", "struct_keywords", "struct_asym",
"struct_biol", "struct_biol_gen", "struct_biol_view", "struct_conf",
"struct_conf_type", "struct_conn", "struct_conn_type",
"struct_mon_prot_cis", "struct_sheet", "struct_sheet_order",
"struct_sheet_range", "struct_sheet_hbond", "pdbx_struct_sheet_hbond",
"struct_site", "struct_site_gen", "database_PDB_matrix", "atom_sites",
"atom_sites_alt", "atom_sites_footnote", "atom_type", "atom_site",
"atom_site_anisotrop", "database_PDB_caveat", "database_PDB_remark",
"pdbx_poly_seq_scheme", ""
samirelanduk commented 3 years ago

Thanks!