williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
52 stars 25 forks source link

error in quantification #57

Open rcyyjiang opened 5 years ago

rcyyjiang commented 5 years ago

Hi. I got the following errors in the quantification step.

IRFinder-1.2.5/bin/IRFinder: line 562: 49201 Broken pipe gzip -cd "$1" 49202 Aborted (core dumped) | "$LIBEXEC/irfinder" "$OUTPUTDIR" "$REF/IRFinder/ref-cover.bed" "$REF/IRFinder/ref-sj.ref" "$REF/IRFinder/ref-read-continues.ref" "$REF/IRFinder/ref-ROI.bed" "$OUTPUTDIR/unsorted.frag.bam" >> "$OUTPUTDIR/irfinder.stdout" 2>> "$OUTPUTDIR/irfinder.stderr" ERROR: IRFinder appears not to have completed. It appears an unknown component crashed. ERROR: IRFinder appears not to have completed. It appears an unknown component crashed. ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.

Actually, I also had a problem in running generateReadsError.pl. It took a lot of time, more than one day. So I killed the generateReadsError.pl. I don't know whether this caused the quantification errors.

Below are more details which might facilitate troubleshooting.

  1. an example of GTF

    chr10 hg19_refGene start_codon 23481460 23481462 0.000000 + . gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding"; chr10 hg19_refGene CDS 23481460 23482243 0.000000 + 0 gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding"; chr10 hg19_refGene exon 23481460 23482243 0.000000 + . gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding"; chr10 hg19_refGene CDS 23482633 23482832 0.000000 + 2 gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding"; chr10 hg19_refGene stop_codon 23482833 23482835 0.000000 + . gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding"; chr10 hg19_refGene exon 23482633 23483181 0.000000 + . gene_id "PTF1A"; transcript_id "NM_178161"; gene_biotype "protein-coding"; transcript_biotype "protein-coding";

By the way, the GTF file and the FA file were named as the manual requested.

  1. Dependencies gcc --version: 5.4.0 STAR --version: 2.5.2b bedtools --version: v2.20.1

  2. Reference building command line IRFinder -m BuildRefProcess -r irfinderdb -b irfinderdb/exclude.bed.gz No error messages came out in this step except that I killed running generateReadsError.pl.

  3. Quantification command line IRFinder -m BAM -r irfinderdb Unsorted.bam

Thanks very much! Any help will be greatly appreciated.

Best, Mei

dg520 commented 5 years ago

Hi Mei,

You shouldn't kill generateReadsError.pl which is a core step during reference building (the mapability estimation), although I'm not sure how you achieve that without killing IRFinder itself. And I don't know why you expect downstream things still work without completely running the beginning part.

What system setup you're working with? How many memory you allocated for IRFinder and how many cores were you using? Human transcriptome GRCh37.75 requires at least 48G memory and 4 cpu threads for both STAR and IRFinder. In terms of that, a typical reference building time is around two hours. If you're working on your own PC, you must know this information. If you're working on a cluster, make sure you allocate sufficient computational resources before running IRFinder.

If you think you have enough cores and memory and if you really think it's generateReadsError.pl that fails your building, you can look into that as well. You might want to independently run that particular part that calls generateReadsError.pl, to investigate why it takes so long or if there is any internal error message from Perl. Here is a shortcut for you (copied from bin/util/Mapability script):

time "$STAREXEC" \
--genomeDir "$STARGENOME" \
--genomeLoad NoSharedMemory \
--runThreadN $THREADS --outStd SAM --outSAMmode NoQS \
--outSAMattributes None \
--outFilterMultimapNmax 1 \
--readFilesIn <("$LIBEXEC/generateReadsError.pl" 70 10 < "$FA") \
| \
awk -v tmpdir="$TMPCHR" -v tmpcmp="$TMPCMP" -v tmpext="$TMPEXT"  'BEGIN {FS="[\t!]"; OFS="\t"} (($8 == "70M") && ($3 == $6) && ($2 == $5)) {print $5, $6-1, $6+69 | ( tmpcmp " -c1 > " tmpdir "/" $5 ".bed." tmpext ) }'

In the above line:

  1. replace $FA with the path to your genome.fa file
  2. replace $STAREXEC with your path of STAR execution
  3. replace $STARGENOME with your path of STAR genome, which IRFinder just built
  4. You can easily replace other variables in the above line (they are all defined in the beginning of bin/util/Mapability. Just read.),

Call it from bash directly. Does it give you any error? If it just needs time to complete, the problem might be due to your system instead of being due to IRFinder.

Nevertheless. let's work around your generateReadsError.pl issue first so that we can make sure the reference can be successfully built in the first place. That might directly solve your problem.

Best, Dadi