williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
52 stars 25 forks source link

IRFinder stops: broken pipe #24

Closed ChrisAi89 closed 6 years ago

ChrisAi89 commented 7 years ago

Dear William Ritchie,

I encounter a weird error message. Whenever I run IRFinder in BAM mode (BAM is not sorted), I receive the error message:
..../IRFinder: line 562: 18627 Broken pipe gzip -cd "$1" 18628 Aborted (core dumped) | "$LIBEXEC/irfinder" "$OUTPUTDIR" "$REF/IRFinder/ref-cover.bed" "$REF/IRFinder/ref-sj.ref" "$REF/IRFinder/ref-read-continues.ref" "$REF/IRFinder/ref-ROI.bed" "$OUTPUTDIR/unsorted.frag.bam" >> "$OUTPUTDIR/irfinder.stdout" 2>> "$OUTPUTDIR/irfinder.stderr"

My command to run IRFinder is: ../IRFinder -d .../my_output_dir -m BAM -r .../ref_genom/irfinder_reference/ test.bam

What do I do wrong? I hope my information is sufficient to clarify my problem.

Best wishes,

Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

Could you please check the IRFinder reference folder and make sure all the files are NOT empty inside the subfolder IRFinder? Please also let me know which GCC version you are using. We can start debug process from there.

Best, Dadi

ChrisAi89 commented 7 years ago

Dear Dadi

my GCC has the version 'gcc (SUSE Linux) 4.8.5'. I ran 'BuildRefProcess' now several times with the command ../IRFinder -m BuildRefProcess -r ../ref_genom/irfinder_reference -e ../ref_genom/irfinder_reference/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz -b ../ref_genom/irfinder_reference/extra-input-files/Human_hg19_wgEncodeDacMapabilityConsensusExcludable.bed.gz -R ../ref_genom/irfinder_reference/extra-input-files/Human_hg19_nonPolyA_ROI.bed

Some of my output is, as you mentioned before, empty: 29M Aug 30 12:41 exclude.directional.bed 27M Aug 30 12:41 exclude.omnidirectional.bed 349K Aug 30 12:41 intergenic.ROI.bed 0 Aug 30 12:41 introns.unique.bed 0 Aug 30 12:41 ref-cover.bed 0 Aug 30 12:41 ref-read-continues.ref 354K Aug 30 12:41 ref-ROI.bed 0 Aug 30 12:41 ref-sj.ref

Best Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

Reference building didn't complete successfully. Please mae sure the following: 1) the gzipped bed file insideMapability sub-folder under IRFinder reference folder should not be empty. If it's empty, make sure STAR is callable and the version is later than 2.4.0h. Also make sure every file under IRFinder bin and bin/utils folder are executable. 2) the FASTA and GTF files you downloaded are compatible (e.g. chromosomes in both file should either in a "1","2","3" form or "chr1","chr2","chr2" form).

If the above is fine, please re-run and show me the entire standard output and standard errors in the reference preparation stage. Try to re-run this with a latest version of bedtools and under GCC 4.9.0 or higher if possible.

IRFinder was built under GCC 4.9.0, which indeed uses C++11 features that might not be supported by some 4.8.x. C++ libraries on 4.8.x vary between systems. But we can get some indications from the standard output during reference preparing.

Best, Dadi

ChrisAi89 commented 7 years ago

Dear Dadi,

I re-ran the process, unfortunately the same output!

Command: IRFinder -m BuildRefProcess -r ./irfinder_reference/ -e ./irfinder_reference/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz -b ./irfinder_reference/extra-input-files/Human_hg19_wgEncodeDacMapabilityConsensusExcludable.bed.gz -R ./irfinder_reference/extra-input-files/Human_hg19_nonPolyA_ROI.bed

The output of the log-file is as follows: "Launching reference build process. The full build should take at least one hour. Usage : .../irfinder-1.2.2/bin/util/IRFinder-BuildRefFromEnsembl mode threads STAR-executable base_ftp_url_of_ensembl_genome+gtf output_directory(must not exist) additional_genome_reference(eg: ERCC) non_polyA_genes-as-bed region_blacklist-as-bed Usage example: ../opt/irfinder-1.2.2/bin/util/IRFinder-BuildRefFromEnsembl BuildRef 12 STAR "ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/" "IRFinder/REF/Human" "Refernce-ERCC.fa.gz" [non_polyA_genes.bed] [blacklist.bed] Aug 31 11:33:40 ..... started STAR run Aug 31 11:33:40 ... starting to generate Genome files Aug 31 11:35:36 ... starting to sort Suffix Array. This may take a long time... Aug 31 11:36:17 ... sorting Suffix Array chunks and saving them to disk... Aug 31 12:07:57 ... loading chunks from disk, packing SA... Aug 31 12:13:34 ... finished generating suffix array Aug 31 12:13:34 ... generating Suffix Array index Aug 31 12:20:56 ... completed Suffix Array index Aug 31 12:20:56 ..... processing annotations GTF Aug 31 12:21:28 ..... inserting junctions into the genome indices Aug 31 12:30:08 ... writing Genome to disk ... Aug 31 12:30:23 ... writing Suffix Array to disk ... Aug 31 12:31:35 ... writing SAindex to disk Aug 31 12:31:47 ..... finished successfully Star genome build result: 0 Commence STAR mapping run for mapability. Thu Aug 31 12:31:49 CEST 2017 Completed STAR run. Thu Aug 31 13:32:51 CEST 2017 Commence Coverage calculation. Completed coverage exclusion calculation. Thu Aug 31 13:39:45 CEST 2017 Mapability result: 0 Build Ref 1 Build Ref 2 Build Ref 3 Build Ref 4 Build Ref 5 Build Ref 6 Build Ref 7 Build Ref 8 Build Ref 9 Build Ref 10 Build Ref 11 Build Ref 12 Build Ref 13b Build Ref 14b Build Ref 15b Build Ref 16 - COMPLETE Ref build result: 0 ALL DONE"

My genome and and gtf-file looks like: a) Genome

1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

b) gtf-file

description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)

provider: GENCODE

contact: gencode@sanger.ac.uk

format: gtf

date: 2013-12-05

1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4";

Seems fine to me. Maybe it is gcc, which I cannot change on our cluster system.

Best Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

Did you still get empty reference file inside IRFinder sub-folder? Seems IRFinder didn't give any error at reference preparation stage but still cannot successfully generate all references. At least it's not due to a GCC problem at this stage. Would you mind share me you GTF and FASTA file so that I can have a detailed look on my end?

Best, Dadi

ChrisAi89 commented 7 years ago

Hi @dg520 ,

do you have a mail address, where I can send the links for my files to?

Best Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

My email is dgao2 et mgh dot harvard dot edu

omansn commented 7 years ago

Hi @ChrisAi89,

I was having the exact same issue as you. Like @dg520 suggested, I resolved the issue by recompiling bedtools with GCC 6.3.0 (before I was running bedtools 2.26.0 compiled with GCC 4.4.6).

If your cluster uses Spack or Environmental Modules, you can change your compiler version. Since I didn't have permissions to re-install bedtools on the cluster spack directory, I had to download it in my home directory and compile it there.

Hope that helps! Nathan

dg520 commented 7 years ago

Hi @omansn,

Thanks Nathan. That's an important finding. Really appreciate that. I didn't realize bedtools would fail on an older version of GCC after a successful installation. I just confirmed this by loading GCC 4.8.0. It failed to call bedtools (v2.26) with an error message saying library missing.

@ChrisAi89, could you please check if you can successfully run bedtools merge, bedtools slop and bedtools supplement on a test file? Not sure yet if that matches your case, as there seems no bedtools error message in your report.

Best, Dadi

ChrisAi89 commented 7 years ago

Hi guys,

thanks for the suggestions, I will give it a try. @dg520 Did you already have time to look at my GTF and fasta file?

Best, Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

I didn't receive anything from you. You can copy my email address from the bottom of the wiki frontpage. Or you can put the link here if that's ok for you.

Best, Dadi

dg520 commented 7 years ago

Hi @ChrisAi89,

Have you sent me the link?

Best, Dadi

dg520 commented 7 years ago

Hi @ChrisAi89,

It turned out that your GTF file doesn't have the exact attribution tags as those of Ensembl annotation at the last column, so that IRFinder cannot get essential information to process reference.

Particularly, IRFinder needs attribution tags transcript_biotype and gene_biotype (according to Ensembl) to find protein coding genes and processed transcripts before building reference. However in your GTF, the two corresponding tags are transcript_type and gene_type.

Solution: You have to replace gene_type with gene_biotype and replace transcript_type with transcript_biotype in your GTF and re-run IRFinder to prepare reference. Assuming your GCC doesn't cause other problems, this should work for your analysis. But at least I believe your can complete reference building successfully now, where no empty file should be generated in the reference folder.

Let me know how it goes.

UPDATE: I just released a new version that works for cases like yours. You can download it the re-run the reference preparation step, if you are reluctant to change your GTF.

Best, Dadi

ChrisAi89 commented 7 years ago

Dear @dg520 , first of all, thanks for your help and the update of IRFinder! I re-run the reference building process again with the new version and now, it seems to have worked perfectly fine. drwxr-xr-x 2 ... 286 Sep 14 17:04 . drwxr-xr-x 7 ... 201 Sep 14 17:01 .. -rw-r--r-- 1 ... 26M Sep 14 17:02 exclude.directional.bed -rw-r--r-- 1 ... 27M Sep 14 17:02 exclude.omnidirectional.bed -rw-r--r-- 1 ... 349K Sep 14 17:04 intergenic.ROI.bed -rw-r--r-- 1 ... 14M Sep 14 17:02 introns.unique.bed -rw-r--r-- 1 ... 95M Sep 14 17:04 ref-cover.bed -rw-r--r-- 1 ... 5.5M Sep 14 17:04 ref-read-continues.ref -rw-r--r-- 1 ... 403K Sep 14 17:04 ref-ROI.bed -rw-r--r-- 1 ... 6.1M Sep 14 17:04 ref-sj.ref

Best, Chris

dg520 commented 7 years ago

Hi @ChrisAi89,

Glad to know problem solved.

Cheers, Dadi