ydottie / hlaproject

Rigorous benchmarking of HLA callers for RNA seq data
MIT License
3 stars 4 forks source link

About T1K's analysis #3

Closed mourisl closed 8 months ago

mourisl commented 1 year ago

Thank you for conducting this comprehensive benchmark that compares so many different HLA genotypers 👍 .

I was reading your biorxiv manuscript and noticed that you just included T1K in your evaluations. As the developer of T1K, I can't help to dig into this accuracy notebook a bit. It seems T1K has missing entries in your standardized output format in the results/standard/ folder. This is likely due to that T1K only reports one allele in the homozygous case. Since one of the main motivations for T1K is to genotype KIRs, and there are many KIR genes only show up in one of the chromosomes. Therefore, we did not distinguish the homozygous case and the single-copy case. After filling the empty entry from its previous column's allele to fix this homozygous case, T1K's results seems to have improved decently, shown in the "T1K_fix" row.

image

It might worth checking whether other tools have the same formatting issue.

The analysis also showed that the 4-digit accuracy was quite low in T1K, so I'm wondering whether you have included the "--preset hla when running T1K"? Could you share your running command of T1K? The default parameters for T1K have fairly relaxed alignment criterion, and are suited for KIR genes or other genes with less comprehensive databases. In our own test data sets, we also observed many rare 4-digits alleles show up when using default parameters. Therefore, for HLA genotyping, we need the "--preset hla" option for better accuracy and computational efficiency.

Thank you again for including T1K in the analysis. I'm looking forward to your reply and other analysis.

P.S.: in the 4 digit accuracy horizontal barplot in accuracy.ipynb, you may need to put T1K to the end of the index to match the order of the fourdig_I/_II.

mourisl commented 1 year ago

Dear @ydottie, I just want to follow up on this issue. Have you got a chance to review it, especially for the running command? Thank you.

ydottie commented 1 year ago

Hi @mourisl Thank you for bringing this to my attention. I appreciate your support for our project and understand your concerns! I will forward this to the member of our team working on T1K. We'll follow-up with you once we implement these suggestions.

likhitha99 commented 1 year ago

Hello @mourisl
Thanks for the suggestions, I'll look into this. I've used this command to generate the results, I've used the --preset hla option "run-t1k -1 $input1 $input2 --preset hla -f hlaidx/hlaidx_rna_seq.fa --od $data_folder"

mourisl commented 1 year ago

Hi @likhitha99 , thank you for sharing the command. Did you put "-2" before the $input2 file? Otherwise, T1K will regard the data as single-end.

likhitha99 commented 1 year ago

Thanks @mourisl for bringing this to my notice, I'll correct it and rerun the script!

mourisl commented 11 months ago

Hi @likhitha99 @ydottie , thank you for rerunning and uploading the new T1K results. I have also reconducted the accuracy analysis in the analysis.ipynb notebook, where it fills the second allele prediction if a tool only outputs one allele in the case of homozygous allele. The new code below is inserted in the block of loading results data as https://github.com/mourisl/hlaproject/blob/main/notebooks/accuracy.ipynb

        for gene in ["A", "B", "C", "DQB1", "DRB1"]:
            target = gene + ".1"
            if (target not in pre.columns):
                continue
            pre[target].fillna(pre[gene], inplace=True)

It seems T1K, HLAHD, and HLAminer evaluations were all affected by reporting homozygous allele only once. The final figure suggests that T1K might be the top performer now? HLA-HD's accuracy also improved a lot compared to the notebook on your repository.

image

ydottie commented 8 months ago

Hi @mourisl -- thank you so much for bringing this to our attention. I have added your code block to the script. We'll update the figures and the preprint as soon as possible. Let me know if there are any other issues that you notice.

mourisl commented 8 months ago

I noticed that the numbers of alleles returned from compute_match across the methods were different on the same dataset. It seems that this is due to that the script will ignore the count if the gene is not in the prediction in the compute_matches function. This may cause some strong biases in some analysis. For example, in the plot for d4 in the accuracy_per_dataset.ipynb analysis. arcasHLA made 4 calls, while other methods tried hard and called more. As a result, arcasHLA got the highest accuracy/precision. It's likely that d4 is a shallow-covered data set, so not all the samples contain enough HLA reads and sensitivity matters. It might be more fair to include those prediction, maybe as two mismatches, in the script.

One very minor comment is that for T1K, we usually suggest the user filter allele calls with a quality score <=0 in the output. It seems those were included in your analysis. But this should be fine, it's worth to let the readers know the results of T1K without any filters. I just want to mention this, so you can disregard this part.

ydottie commented 8 months ago

@mourisl Thank you! We are very grateful for your comments and time to help us improve the accuracy script.

In regards to your first comment, you are right that our code currently excludes all alleles where a prediction is not made. You bring up a very good point regarding bias, however, I need to think a little more about how we will treat those empty calls -- to me they are neither fully accurate nor fully inaccurate. I'll get back to you on that one.

In regards to your second comment, noted, we'll add a note to the manuscript.

ydottie commented 8 months ago

For now, I modified the accuracy_per_dataset.ipynb notebook to take into account a separate category "no calls" which includes all alleles for which no call is made. For instance, Arcas now has a much lower accuracy in dataset 4, because a vast majority of the alleles have been re-categorized as a "no call." My hope is that this will mitigate the issue of bias that you thoughtfully raised. Thank you again for your feedback!

mourisl commented 8 months ago

Thank you for making the updates so fast! I agree that empty calls could be ambiguous to process. Your strategy of explicitly putting no call as a category is an elegant solution to visualize this issue.

No matter what, including missing genes or not does not change the overall trend/conclusion of this excellent study.