welch-lab / liger

R package for integrating and analyzing multiple single-cell datasets
GNU General Public License v3.0
381 stars 78 forks source link

Considering duplicated reads in scATAC tutorial? #227

Closed frankligy closed 3 years ago

frankligy commented 3 years ago

Hi,

Thanks a lot for the great tool and nice vignettes, which are super helpful!

I have a question regarding your RNA+ATAC tutorial (http://htmlpreview.github.io/?https://github.com/welch-lab/liger/blob/master/vignettes/Integrating_scRNA_and_scATAC_data.html). When calculating fragments counts overlapping with gene body and promoter region, bedmap was utilized and all cell barcodes that fall into gene or promoter region will be listed out in the output.

However, I noticed that, the last column in the fragments.tsv file, which indicates how many duplicated reads in the cell that were observed for each fragment, seems to not be used in the counting process, meaning to say even if there are 10 reads that support the fragment, it will only be counted once, which was a bit confusing to me.

# a example fragments.tsv file
chr1 | 10072 | 10198 | ACCAAGTTCGTTTCCA-1 | 1
chr1 | 10072 | 10260 | GGCATGGAGGTGTTAC-1 | 1
chr1 | 10073 | 10216 | AGTCAGGCATTGACAT-1 | 1
chr1 | 10073 | 10400 | AAACCGGCAAGGAATC-1 | 1
chr1 | 10084 | 10346 | TGATCGAGTCTAACAG-1 | 1
chr1 | 10085 | 10198 | GCAATCTAGCATGACT-1 | 1
chr1 | 10085 | 10261 | GAGGTGAGTTTGGGTA-1 | 1
chr1 | 10085 | 10284 | CCTCCTCTCTAAGGAG-1 | 1
chr1 | 10085 | 10302 | CGCAAATTCGCATTAA-1 | 1
chr1 | 10085 | 10308 | CCTGTATGTTTACTTG-1 | 1
chr1 | 10085 | 10332 | CTAGTTGCATCTTGAG-1 | 1
chr1 | 10085 | 10334 | ACATTGCAGCGGGCAA-1 | 2
chr1 | 10085 | 10338 | ACTAACCAGACAGGTA-1 | 1

Shouldn't we use both echo-map-id and echo-map-score to also report the number reads associated with each fragment? (below is a excerpt from bedmap tutorial)

Thanks, Frank

To demonstrate their use, we revisit the Motifs dataset, which includes p-values reporting the statistical significance of putative transcription factor binding sites:

chr1    4534161 4534177 -V_GRE_C        4.20586e-06     -       CGTACACACAGTTCTT
chr1    4534192.4.394205 -V_STAT_Q6      2.21622e-06     -       AGCACTTCTGGGA
chr1    4534209 4534223 +V_HNF4_Q6_01   6.93604e-06     +       GGACCAGAGTCCAC
chr1    4962522.4.392540 -V_GCNF_01      9.4497e-06      -       CCCAAGGTCAAGATAAAG
chr1    4962529 4962539 +V_NUR77_Q5     8.43564e-06     +       TTGACCTTGG
...
Let’s say we want a list of motifs and associated p-values mapped to a coordinate range of interest (chr1:4534150-4534300). In order to conserve space, however, we only want two significant figures for the score data. So we use --prec 2 to try to reformat the score output:

$ echo -e "chr1\t4534150\t4534300\tref-1" \
    | bedmap --prec 2 --echo --echo-map-id --echo-map-score - motifs.bed \
    > motifsForRef1.bed
Here is the output:

chr1    4534150 4534300 ref-1|-V_GRE_C;-V_STAT_Q6;+V_HNF4_Q6_01|0.00;0.00;0.00
jw156605 commented 3 years ago

Hi Frank,

This reflects the previous version of the 10X Cellranger pipeline, which only allows counting one read (across all cells) from a particular position in the genome. It’s an aggressive filtering strategy to avoid PCR duplicates. In the latest version of Cellranger, 10X has relaxed this to allow reads that have unique (start, end, barcode) combinations. We may update our pipeline to reflect this change.

frankligy commented 3 years ago

Thanks a lot for the reply!

Best, Frank