reference index - Githubissues

YongchaoDou commented 7 years ago

Hi William,

I tried to build a reference by BuildRefProcess mode. However, it always shows such error:

Error: The requested bed file (introns.unique.bed) could not be opened. Exiting! Error: The requested bed file (introns.unique.bed) could not be opened. Exiting! Error: The requested bed file (/dev/fd/63) could not be opened. Exiting! sort: write failed: standard output: Broken pipe sort: write error sort: write failed: standard output: Broken pipe sort: write error

Can you help me to figure out the problem? Following are more detail information.

Thank!

Yongchao Dou

#################################################

my command line

################################################# bin/IRFinder -m BuildRefProcess -r irfinderIndex -e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz -b REF/extra-input-files/Human_hg19_wgEncodeDacMapabilityConsensusExcludable.bed.gz -R REF/extra-input-files/Human_hg19_nonPolyA_ROI.bed #################################################

example of my annotation

################################################# 12 hg19_refGene exon 9220304 9220435 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9220779 9220820 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9221336 9221438 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9222341 9222409 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9223084 9223174 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9224955 9225082 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9225249 9225467 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9227156 9227379 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9229352 9229532 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9229942 9230016 . - . gene_id "A2M"; transcript_id "NM_000014"; #################################################

example of my genome

#################################################

10 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

dg520 commented 7 years ago

Hi Yongchao,

This is wired. Not sure why your system cannot write things. Please check the file irfinderIndex/IRFinder/tmp.candidate.introns. If it's there and is not empty, can you manually run the following:

cd irfinderIndex/IRFinder
sort -t $'\t' -s -S 500M -k1,1 -k2,2n -k3,3n -k6,6 -k4,4 < tmp.candidate.introns | sort -t $'\t' -s -S 500M -k1,1 -k2,2n -k3,3n -k6,6 -u  > introns.unique.bed

Does it give you the same error?

Best, Dadi

YongchaoDou commented 7 years ago

It's there. But it's empty. May you help me to check the annotation format? #################################################

example of my annotation

################################################# 12 hg19_refGene exon 9220304 9220435 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9220779 9220820 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9221336 9221438 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9222341 9222409 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9223084 9223174 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9224955 9225082 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9225249 9225467 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9227156 9227379 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9229352 9229532 . - . gene_id "A2M"; transcript_id "NM_000014"; 12 hg19_refGene exon 9229942 9230016 . - . gene_id "A2M"; transcript_id "NM_000014";

Best,

Yongchao Dou

YongchaoDou commented 7 years ago

I tried the ensembl annotation (Homo_sapiens.GRCh37.75.gtf.gz) by the same command line. It works well. It looks the tool can't analyze my annotation well.

Best,

Yongchao Dou

YongchaoDou commented 7 years ago

When I tried ucsc hg19 genome and ensemblannotation, the introns.unique.bed is no empty. But there are still errors: ################################################### Error: malformed BED entry at line 2612544. End Coordinate detected that is < 0. Exiting. sort: write failed: standard output: Broken pipe sort: write error Error: malformed BED entry at line 3476867. End Coordinate detected that is < 0. Exiting. sort: write failed: standard output: Broken pipe sort: write error Error: malformed BED entry at line 166196. End Coordinate detected that is < 0. Exiting. sort: write failed: standard output: Broken pipe sort: write error. ############################################ I think we should limit both genome and annotation to chr1-24,X,Y,and M. All other contigs should be removed.

Best,

Yongchao Dou

dg520 commented 7 years ago

Hi Yongchao,

I strongly recommend you using both FASTA and GTF from the same resource (e.g. Ensembl FASTA and Ensembl GTF). This is important as different resources of FASTA and GTF can conflict each other. For example, Ensembl record human chromosome as "1, 2, 3..." while that is "chr1, chr2, chr3..." in UCSC. Such difference would lead to the failure of matching gene coordinates onto the genome. Another essential concern to have GTF and FASTA from the same resource is that it will ensure all the genes are inside the range of the chromosome. I think your failure of building reference might be due to some of genes in your annotation are out of range of its chromosome in your FASTA. You might want to check that. And the third concern is what you said, the chromosome in the GTF file should be fully contained in the FASTA. It's usually not the case when you are using GTF from other resources than the FASTA file.

P.S. be aware that there is no 23 and 24 for human chromosome.

Best, Dadi

YongchaoDou commented 7 years ago

Hello Dadi,

Can you help me to index my genome and annotation? You can download these genome and annotation here: ucsc hg19 genome: https://bcm.box.com/s/gr64q7b9x1rslpj8y3yccgup8xrg6pp2 refseq hg19 annotation: https://bcm.box.com/s/cc5wtj17791sq5wcl9rzf1pdlui57ouu

Thank you very much!

Yongchao Dou

dg520 commented 7 years ago

Hi Yongchao,

I took a closer look at your GTF. The problem is in the second column of the file, which only has the key word "hg19_refGene" and is not informative. IRFinder requires that column to indicate outcomes of transcripts (e.g. protein_coding, processed_transcript and etc). Actually it has to be described in the exactly same way as Ensembl symbols in that term, because IRFinder builds intron index based on that information.

My suggestion is to either use an Ensembl GTF (make sure chromosomes are recorded in the same way as the FASTA) or find a way to convert the second column of your current GTF to key words like Ensembl ones.

Best, Dadi

dg520 commented 7 years ago

Hi Yongchao,

I have to correct myself. The “process_transcript” and “protein_coding” information is not extracted from the second column of the GTF. Instead, IRFinder gets that information from the attribute “transcript_biotype” in the last column. This attribute must present in the GTF. Sorry for confusing.

Best, Dadi

YongchaoDou commented 7 years ago

Hi Dadi,

Thanks for your help. I will test it soon.

Best, Yongchao Dou

williamritchie / IRFinder

reference index #13

my command line

example of my annotation

example of my genome

example of my annotation