zavolanlab / PAQR_KAPAC

scripts, pipelines and documentation to run PAQR and KAPAC; KAPAC allows to infer regulatory sequence motifs implicated in 3’ end processing changes; PAQR enables the quantification of poly(A) site usage from standard RNA-seq data
GNU General Public License v2.0
8 stars 4 forks source link

Errors when running PAQR step2 #19

Open Sylarair opened 4 years ago

Sylarair commented 4 years ago

Hi,

Thanks for developing PAQR_KAPAC. I am facing a problem when running step2 of PAQR with the source code (error shown in the attached figure).

Does anyone have idea from which the error comes? Many Thanks!

Screen Shot 2020-09-05 at 7 24 12 PM
Sylarair commented 4 years ago

By the way, The data I used in the code is from https://www.encodeproject.org/experiments/ENCSR000CWC/ (ENCFF391BFB.bam, ENCFF983OXY.bam). Also, the config.yaml setting is shown in the attached figure. In addition, I have put the problem into the github issue. Finally, the attached text file is the reference used in my config.yaml.

Screen Shot 2020-09-05 at 7 34 39 PM

clusters.mm10.vM22.no_tsl.atlas2.canonical_chr.tandem.noOverlap_strand_specific.txt

full_transcripts.mm10.vM22.no_tsl.atlas2.canonical_chr.tandem.noOverlap_strand_specific.txt

koljaLanger commented 4 years ago

Hi

can you please post the output of cat HNRPC_KD/no_bias_samples.out. One possible reason would be that the samples have biased coverages and are excluded from processing. In this case, we highly recommend to do some sanity checks, e.g. visually inspect the coverage in a genome browser to ensure that the sequencing data has trustworthy quality.

If this is really the reason for the error, you could mitigate it by lowering the "bias.median.cutoff" parameter in the config file.

Rgds Ralf

Sylarair commented 4 years ago

Thanks for your reply. The file is empty.

Screen Shot 2020-09-05 at 8 26 10 PM
Sylarair commented 4 years ago

Also, when I tried to rerun, it shows such logs. What should I do with this? Thanks!

Screen Shot 2020-09-05 at 8 28 19 PM
Sylarair commented 4 years ago

Does this mean there is no valid sample? Thanks.

Screen Shot 2020-09-05 at 8 36 52 PM
Sylarair commented 4 years ago

Meanwhile, the two samples median TIN is 76 and 79. It seems ok (default bias.median.cutoff = 70).

Screen Shot 2020-09-05 at 8 38 10 PM
koljaLanger commented 4 years ago

Ok, I think a know the root cause: We designed PAQR in a way that it provides the input for KAPAC which is why PAQR always works with control-treatment sample pairs. In your case, both samples are labeled as "HNRNPC" in the config and so no valid pair of samples is inferred in the first part.

An easy workaround would be to adjust your config file as follows:

ctl_rep1: {bam: ENCFF391BFB, type: CNTRL}
HNRNPC_rep1: {bam: ENCFF983OXY, type: KD, control: ctl_rep1}

In this setting, running PAQR would result in the estimation of poly(A) site usage for both samples independently.

Hope this helps! Ralf

Sylarair commented 4 years ago

Oh, thanks so much! Actually, The two bams are bio-replicates. Would this setting not affect results?

koljaLanger commented 4 years ago

It affects the results slightly, because PAQR would normally take advantage of the two samples being replicates. But it should not be an issue.

Please bear in mind, though, that running KAPAC requires the comparison across conditions - in your case you would need a reference condition to compare against.

Sylarair commented 4 years ago

Got it, thanks!

Sylarair commented 4 years ago

Thanks for your suggestions because I get the PAQR results successfully.

However, facing so many results, I am confused about the difference of 3 files (relative_usages.filtered.tsv, relative_usages.tsv, and relative_usages.relPos_per_pA.out) and wonder which one is the proximal polyA site usage. Could you please explain it to me for I have not found their interpretation in the manual. Many thanks!

Qin

Screen Shot 2020-09-06 at 11 10 44 PM
koljaLanger commented 4 years ago

Hi Qin

there are two files which are worth looking into if you want to obtain the proximal poly(A) site usage:

relative_usages.filtered.tsv tandem_pas_expressions.rpm.filtered.tsv

Both files contain one poly(A) site per line and poly(A) sites are grouped by exons and sorted proximal to distal -> so each line that has a new entry in the exon column indicates a proximal poly(A) site. The last n columns (n = number of samples; so in your case, the last 2 columns) show the relative usage of the the poly(A) site (in the case of relative_usages.filtered.tsv) or the library size normalized expression (in the case of tandem_pas_expressions.rpm.filtered.tsv).

Hope that helps

Best Ralf

Sylarair commented 4 years ago

Thanks for your answer which clears my doubts. There is another 2 question which interest me are: 1. what is used to filter 'relative_usages.tsv' to get 'relative_usages.filtered.tsv'? 2. When applying PAQR to analyze forebrain E13.5 day polyA RNA-Seq data, I find that relative usage of proximal polyA sites in most of genes in relative_usages.filtered.tsv are very high (70~90). This is wired and let me wonder whether I am wrong in some step. Could you please give some comments? Thanks!

PAQR_PPAU.pdf

koljaLanger commented 4 years ago

The filtering simply selects exons with multiple poly(A) sites which have read support in all samples - so this can be considered more of a data cleaning step. Regarding your proximal-distal usage distribution: it is hard to identify a particular reason for your results without further details. In general, cell from the brain have extended 3' ends - but I don't know if this still holds for embryo samples. A good starting point would be to get some global statistics: how many sites were quantified? How does the expression distribution looks like? Do you have an expected distribution?