Nextstrain-compatible JSON subtrees are now missing key: "value" pairs

mdperry commented 1 year ago

Hi, Every week I run this command a couple of times (for well over a year now):

matUtils extract -M public.plusGisaid.latest.metadata.tsv,samplemeta.tsv,user_seqs.translation.tsv -i new_tree.pb -j CDPH -s sample.ids -N 500

The public metadata file has 12.9 million records, the samplemeta file has 560K records, the protobuf tree has nearly 13 million samples. I get out over 81,000 CDPH-subtree-xxxxx.json subtrees, along with a nice table named subtree-assignments.tsv.

Some time in between OCT-28-2022 and NOV-02-2022, the JSON files produced by this command . . . changed. They are now smaller, approximately one half the size. The main difference I have noticed/found is this:

OCT-28 = "BEFORE"
  "tree": {
    "children": [
      {
        "branch_attrs": {
          "labels": {
            "aa_mutations": "ORF6:D30D",
            "leaves_sharing_mutations": "137542",
            "nt_mutations": "T27291C",
            "nuc mutations": "G210T,C241T,C3037T,G4181T,C6402T,C7124T,C8986T,G9053T,C10029T,A11201G,A11332G,C14408T,G15451A,C16466T,A17236G,C19220T,C21618G,T22917G,C22995A,A23403G,C23604G,C24208T,C25469T,T26767C,T27291C,T27638C,C27752T,C27874T,A28461G,G28881T,G28916T,G29402T,G29742T",
            "sample": "node_324504"
          },
          "mutations": {
            "nuc": [
              "G210T",

NOV-02 = "AFTER"
  "tree": {
    "children": [
      {
        "branch_attrs": {
          "labels": {
            "nuc mutations": "G210T,C241T,C3037T,G4181T,C6402T,C7124T,C8986T,G9053T,C10029T,A11201G,A11332G,C14408T,G15451A,C16466T,A17236G,C19220T,C21618G,T22917G,C22995A,A23403G,C23604G,C24208T,C25469T,T26767C,T27291C,T27638C,C27752T,C27874T,A28461G,G28881T,G28916T,G29402T,G29742T",
            "sample": "node_324504"
          },
          "mutations": {
            "nuc": [
              "G210T",

There are three labels missing from the branch_attrs object. There may well be many other errors (and I am sure that there are), of a similar nature, where key: "value" pairs that I need are either missing, or in some different part of the data structure. This is just the first, clear, simple example I discovered when comparing the two files side by side.

Has anyone else noticed this? If this behavior is reproducible then I am curious to know if it was an intentional change to the output subtrees, or an inadvertent change to the code base.

From reviewing the dates of the usher repo's releases, this timing seems to coincide with the release of UShER version 6.0. I should say that whenever I am using usher and/or matUtils in our production environment, I am pulling it from this docker container: https://hub.docker.com/r/pathogengenomics/usher My understanding it that we are using GitHub Actions to update the docker container whenever there is a new release here.

Thanks, -- Marc Perry UCSC Genomics Institute Pathogen Genomics Team

jmcbroome commented 1 year ago

It appears that the missing label attributes are derived from the metadata input files (-M). Perhaps something has changed with those metadata files? Are they being correctly parsed, and are the sample ids matching between the tree and the metadata files? If you remove the -M argument from your command, do you get the same output JSON as you do now?

mdperry commented 1 year ago

All of the input tables, files, and parameters between the two runs were identical; I was running this in a dev environment. In production, runs from different dates would not have identical inputs since Angie's nightly global tree and metadata files are always growing. So the only difference between the two runs is the usher version in the docker images.

I have not tried running any of this in this context without the -M argument, but I will go back and try that in case it is illuminating. Oh, the sample ids are cut out of column 1 of the samplemeta.tsv table, so yes they have to match.

mdperry commented 1 year ago

UPDATE: I have now compared UShER/matUtils v.5.6 vs. v.6.0 using all three metadata files (the original version above), zero metadata files, and each metadata file by itself, so the public_meta, my samplemeta.tsv and the user_seqs.translation.tsv able on their own.

It is quite clear that UShER/matUtils v.6.0 fails to add metadata from the public_meta table, or the user_seqs.translation.tsv table to the 497 samples in the subtree that are not also contained in the samplemeta.tsv table. The presence of a sample in the samplemeta.tsv table, which is also in the sample.ids list, ensures that the full metadata available gets attached to those 3/500 samples. This anomalous behavior is now also affecting/impacting the 81K+ subtrees we produce for the California Big Tree investigator. I am no expert but this seems like a bug to me.

jmcbroome commented 1 year ago

Handling metadata for json output has been messy in the past, and we've previously made some changes around not storing metadata values for nodes that are not part of the output to reduce memory consumption- its possible that your results are due to this. If this diagnosis is correct, the presence of a sample in the samplemeta.tsv table isn't important, but rather you would only be retaining metadata from any table for samples listed in samples.ids.

However, I'm not sure why this behavior would have changed for this specific version update- it should have been happening before you upgraded if this is the explanation. The code for this type of metadata parsing and json production hasn't changed in more than a year. There's nothing in the compare between v0.5.6 and v0.6.0 that I can identify that should be changing this. I welcome insight from the other developers, if they're aware of anything I might have missed.

Using -N output, you should be producing a single "subtree-assignments.tsv" file that contains some sample metadata information. What does this output look like? Has it changed as well between these versions?

mdperry commented 1 year ago

No, the subtree-assignments.tsv file has not changed either. I was confused (and wrong/incorrect) about when the change in the UShER code base was implemented that changed the structure of the output auspice.us-compatible JSON subtrees. It was between UShER versions 0.5.3 (2022-03-12) and version 0.5.4 (2022-04-25). I was able to install these two versions on VMs and replicate the effect. If I understand the output from the git log command, there were a couple of commits on APR-19/APR-20 that modified the matUtils file named select.cpp, to only use information from the metadata files if the samples were specified in the sample.ids file. And this does (I think) describe the effect that I am seeing. Namely, for our production pipeline for CDPH we produce subtrees containing 500 samples, and between 1 and 500 of those samples are in our list of about 590K samples.

What used to happen was that other, non-California samples, in those subtrees, would retain their rich metadata from @AngieHinrichs metadata table that accompanies her nightly build of the global tree. Specifically, the clade and the lineage are in her table, and in matUtils extract from UShER v.0.5.3 (and earlier), that metadata for the non-Californian samples would populate both the branch_attrs { JSON } and the node_attrs { JSON }. However, for matUtils extract versions 0.5.4 (and higher), none of the information in the other metadata tables gets added to the subtrees that are generated (if those samples are not in the list of sample.ids). As far as I can tell, only the country, and the date, which seem to be in the protobuf file itself, get added to the subtree samples for non-Californian samples, and these only get added to the node_attrs { JSON }.

I gather that the clade and lineage information is attached to each node in the MAT, but for reasons that I don't understand, that information is not exported when the auspice-compatible JSON subtrees are generated.

This screenshot shows what one subtree used to look like:

While this screenshot shows what it looks like now:

I assume that the the code change to limit supplementary metadata to only samples on the requested list was implemented for a reason, but as a major user of the subtree extraction feature of matUtils, I was wondering if it would be possible to add this as user-specified flag (as in, Do you really want metadata for subtree samples that are not on your list?).

Thanks

jmcbroome commented 1 year ago

It was implemented because it was extremely memory costly to load a tree with millions of samples worth of metadata for users extracting a small subtree. However, you are very correct that this meshes poorly with our implementation of -N/-K type iterative subtree creation. Adding a flag such as you describe should be simple- I will look into it this week.

kvargha commented 1 year ago

Hi @jmcbroome,

I tested out the --load-all-metadata flag and my VM ran out of memory, as expected. Is it possible to load the metadata only for extracted samples and not the entire tree?

jmcbroome commented 1 year ago

The short answer is no. The longer answer is that doing so would be substantially complex. Query samples are established initially though -s and other options, so loading metadata for just those is straightforward. We don't figure out what context samples we need, however, until we're actually in the functions building the relevant subtrees, long after we choose which metadata to load into memory.

I would recommend doing some preprocessing- if you know all of your query samples are in Omicron, you don't need metadata for Delta samples, because their context is also pretty much going to be Omicron. Subset your table before you load it.

AngieHinrichs commented 1 year ago

We don't figure out what context samples we need, however, until we're actually in the functions building the relevant subtrees, long after we choose which metadata to load into memory.

Could the metadata be loaded after subtrees are built?

jmcbroome commented 1 year ago

It is possible, but it will take substantially more work and testing than just adding the load-all flag. It would necessitate refactoring of the -N and -K options, which arguably should be done at some point anyways given their inconsistencies, but it's not something I'd be able to accomplish immediately like the aforementioned load-all flag. If our users want metadata for context samples in the near future while having limited memory, they will need to work around it by intelligently subsetting their metadata file before loading.

jmcbroome commented 1 year ago

I can look into what it would take in the next few weeks, perhaps as a larger rework/reorganization of matUtils extract, which is generally bloated with inconsistent behavior and implementations across arguments.

yatisht / usher

Nextstrain-compatible JSON subtrees are now missing key: "value" pairs #303