yuntianf / LongcellPre

A pipeline for Nanopore single cell isoform quantification in R
MIT License
4 stars 1 forks source link

Definition of iso_count.txt terms #2

Closed Theo-Nelson closed 3 months ago

Theo-Nelson commented 6 months ago

Dear LongcellPre developers,

Thank you for your pipeline. I was curious if you had published or could provide definitions for how each of the individual columns relates as it relates to your mapping algorithm. The program outputs size, cluster, count, and polyA, which seem to be overlapping categories of increasing 'expansiveness.' It also seems as though there could be fractional counts while the size / cluster values seem to be integers.

Thank you very much!

Sincerely, Theo

yuntianf commented 6 months ago

Hi Theo, Sorry for confusion, the size column means the raw count for that read, and cluster means the UMI count after UMI clustering, while count means the final UMI count after filtering scattered UMIs. So count is the final UMI count you will use. I keep the size and cluster for diagnosis and will remove those two columns later. The polyA column means the existence of polyA tail for that read. As each read in the output is collapsed from a UMI cluster with multiple reads, thus the polyA is the average. In downstream analysis I use 0.5 as the threshold to indicate if a read has polyA. Thanks for the reminder, I will also update above illustration in the github README page.