zavolanlab / PAQR_KAPAC

scripts, pipelines and documentation to run PAQR and KAPAC; KAPAC allows to infer regulatory sequence motifs implicated in 3’ end processing changes; PAQR enables the quantification of poly(A) site usage from standard RNA-seq data
GNU General Public License v2.0
8 stars 4 forks source link

Further question about how to properly set up the sample relationships #14

Open aleighbrown opened 4 years ago

aleighbrown commented 4 years ago

A bit confused about the appropiate way to set up the samples in the config.yaml

Currently the config.yaml as provided when you download looks likes this:

#-------------------------------------------------------------------------------
# sample specific values:
# - name of samples per study
# - name of BAM file and condition per sample
#-------------------------------------------------------------------------------

HNRNPC_KD:
  samples: [ctl_rep1, ctl_rep2, HNRNPC_rep1, HNRNPC_rep2]

ctl_rep1: {bam: CTL_rep1, type: CNTRL}
ctl_rep2: {bam: CTL_rep2, type: CNTRL}
HNRNPC_rep1: {bam: KD_rep1, type: KD, control: ctl_rep1}
HNRNPC_rep2: {bam: KD_rep2, type: KD, control: ctl_rep2}

Are the HNRNPC_rep1 being directly compared to ctl_rep1? What if my samples don't have such a clear cut this control should be compared to this case relationship, eg, I've done 3 biological replicates in each condition but they're not what I would call directly matched.

If my samples are MUT1,MUT2,MUT3, WT1,WT2,WT3 how would it make a difference in the final analysis if matter if I did set up the relationship as

MUT1: {bam: MUT1, type: MUT, control: WT1}
MUT2: {bam: MUT2, type: MUT, control: WT2}

vs

MUT1: {bam: MUT1, type: MUT, control: WT2}
MUT2: {bam: MUT2, type: MUT, control: WT3}

What if my sample sizes for conditions weren't matched, if I have 5 in one condition and 8 in another for example?

Thanks!

koljaLanger commented 4 years ago

PAQR just runs condition wise, so in the inference of poly(A) site usage it does not make a difference what you put as control for the mutation samples. However, the KAPAC step needs a reference sample to compare against; so results may change depending on which of the wild type sample you use as control. That being said, it is not necessary that you have matching samples of treatment vs control.

Probably, it would even be of interest for us if you change the control samples in two independent runs and get to completely different results. We'd expect that results should be stable towards this type of alteration.

Hope this helps for now.

Best, Ralf

SamBryce-Smith commented 4 years ago

Just to tag onto this issue, it appears that the sample relationships defined in the config can affect whether samples pass the mTIN > 70 filter in part_one.Snakefile. In the case below, only pairs of samples that both have mTIN > 70 are considered valid, despite many in my HOM condition having > 70 mTIN.

As I've defined the sample relationships here, only the HOM-3 : WT-3 pairing passes the filter.

bias.TIN.median_per_sample.tsv sample median_TIN IP-WT-D14-1 60.078931 IP-WT-D14-2 63.014136 IP-WT-D14-3 72.905163 IP-WT-D14-4 70.372223 IP-HOM-D14-1 71.532313 IP-HOM-D14-2 70.307760 IP-HOM-D14-3 74.176115 IP-HOM-D14-4 68.654441 IP-HOM-D14-5 70.127562 IP-HOM-D14-6 70.768449

(config.yaml) IP-WT-D14-1: {bam: IP-WT-D14-1_unique_rg_fixed, type: IP_D14_CNTRL} IP-WT-D14-2: {bam: IP-WT-D14-2_unique_rg_fixed, type: IP_D14_CNTRL} IP-WT-D14-3: {bam: IP-WT-D14-3_unique_rg_fixed, type: IP_D14_CNTRL} IP-WT-D14-4: {bam: IP-WT-D14-4_unique_rg_fixed, type: IP_D14_CNTRL} IP-HOM-D14-1: {bam: IP-HOM-D14-1_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-1} IP-HOM-D14-2: {bam: IP-HOM-D14-2_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-2} IP-HOM-D14-3: {bam: IP-HOM-D14-3_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-3} IP-HOM-D14-4: {bam: IP-HOM-D14-4_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-4} IP-HOM-D14-5: {bam: IP-HOM-D14-5_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-1} IP-HOM-D14-6: {bam: IP-HOM-D14-6_unique_rg_fixed, type: IP_D14_HOM, control: IP-WT-D14-2}

At this stage, our main interest in this data-set is the inference of poly(A) site usage. Following on from what you've said, would you say it's acceptable to change the control samples for my HOM set so they point to the WT-3 & WT-4 samples (the WT samples are biological replicates)?

Thanks, Sam

koljaLanger commented 4 years ago

Hi Sam yes, I think that is what I would suggest to do in this case.

Best Ralf