[Issue] pileup "feature"

vpbrendel commented 2 years ago

The pileup.c code implicitly relies on alphabetical ordering of the SQ entries in the input sam/bam header, at least for the tsv display.

The data columns in the tsv file are displayed in rows corresponding to the order of the SQ entries in the header. However, the names in the rows are always alphabetized.

This leads to wrong association of names versus data if the entries in the header are not in alphabetical order.

How to reproduce? Create a bam input with header SQ lines for chr1 and chr2 and corresponding data. Then re-run with only the two header SQ lines in opposite order (chr2 before chr1). The tsv output will then display the chr1 data under name chr2 and the chr2 data under name chr1.

zwdzwd commented 2 years ago

Thanks for your report. Can you post this to https://github.com/huishenlab/biscuit ? But I don't understand your question since the SAM header does fully describe the chromosome names. If you switch it manually it will get displayed differently.

vpbrendel commented 2 years ago

This came up using an existing (bismark) generated bam input file for which the header had sequence names NOT in alphabetical order. The pileup code includes a sorting function that puts the names in alphabetical order. That's what shows in the tsv file. However, the data entries are in the order of the bam header, and thus the table gives wrong associations.

zwdzwd commented 2 years ago

Thanks for reporting. I think you are right. Can you confirm the tsv you mentioned is the meth average stats tsv not the methylation call? If so, I think I know how to fix it (should just affect the meth average stats)

vpbrendel commented 2 years ago

Correct. Methylation average stats.

zhou-lab / biscuit

[Issue] pileup "feature" #47