yala / Mirai

This repository was used to develop Mirai, the risk model described in: Towards Robust Mammography-Based Models for Breast Cancer Risk.
MIT License
110 stars 45 forks source link

Risk Factor Data Format #12

Closed rictoo closed 1 month ago

rictoo commented 1 month ago

Thanks for the excellent project!

How would one provide risk factor information to the model, rather than having the model infer them from the image? I could not find documentation on your Github page that explained how to do this (perhaps I'm missing something).

I am running main.py with the following arguments: python scripts/main.py --model_name mirai_full --img_encoder_snapshot ~/scratch/dadams/mirai/snapshots/mgh_mammo_MIRAI_Base_May20_2019.p --transformer_snapshot ~/scratch/dadams/mirai/snapshots/mgh_mammo_cancer_MIRAI_Transformer_Jan13_2020.p --callibrator_snapshot ~/scratch/dadams/mirai/snapshots/callibrators/MIRAI_FULL_PRED_RF.callibrator.p --batch_size 4 --dataset csv_mammo_risk_all_full_future **--use_risk_factors --use_pred_risk_factors_if_unk --risk_factor_metadata_path ~/scratch/dadams/mirai/rf_metadata.json** --metadata_path ~/scratch/dadams/mirai/metadata_subset.csv --test --prediction_save_path ~/scratch/dadams/mirai/genbcpred_output_rftest.csv

I wasn't able to find a sample risk factor metadata file, so I tried to infer its structure from the Mirai source code (seemingly unsuccessfully) as:

{
  "ID1234": {
    "binary_family_history": 1,
    "accessions": {
      "0": {
        "age": 45
      }
    }
  }
}

which corresponds to this in metadata_subset.csv:

"ID1234",0,"R","CC","/home/dadams/scratch/dadams/mirai/png_outputs/ID1234_05_anon.dcm.png",2,2,"test"
"ID1234",0,"L","CC","/home/dadams/scratch/dadams/mirai/png_outputs/ID1234_06_anon.dcm.png",2,2,"test"
"ID1234",0,"R","MLO","/home/dadams/scratch/dadams/mirai/png_outputs/ID1234_07_anon.dcm.png",2,2,"test"
"ID1234",0,"L","MLO","/home/dadams/scratch/dadams/mirai/png_outputs/ID1234_08_anon.dcm.png",2,2,"test"

However, when I run main.py with the aforementioned arguments, I get this error message:

Traceback (most recent call last):
  File "/home/dadams/temp/mirai/Mirai/onconet/utils/risk_factors.py", line 372, in parse_risk_factors
    metadata_json = json.load(open(args.metadata_path, 'r'))
  File "/home/dadams/scratch/dadams/mirai/.../lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/dadams/scratch/dadams/mirai/.../lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/dadams/scratch/dadams/mirai/.../lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 13 (char 12)

...
Exception: Not found /home/dadams/scratch/dadams/mirai/metadata_subset.csv Extra data: line 1 column 13 (char 12)

Oddly, when running main.py with the risk factor-related arguments, it seems to be suddenly try parsing the *.csv file as a JSON file. When excluding the risk factor-related arguments, the whole pipeline runs successfully.

Could anyone be so kind as to help format and provide risk factor data to Mirai successfully? :)

Thank you!

yala commented 1 month ago

Our original Risk factor pipeline was tightly integrated with MGH's radiology database internals (Magview SQL) and that full integration isn't part of our code-release since it's specific to MGH.

The CSV-based dataset class does not support risk factors now, but you could augment it to do so in a fork. The MGH dataset, in datasets, shows a reference on how we did this, but the logic is a little complicated. It might be easier to do this from first principles, loading your JSON and trying to fit it into the RiskFactorVectorizer

In general, since the performance difference w and w.o risk factors is marginal, we only support image-only versions for deployments since it's much easier to manage