Problems building the Reference

ClariHC commented 3 years ago

Hi I'm building the reference and I having this problem!

This is my code

~/IRFinder-1.3.0/bin/IRFinder -m BuildRefFromSTARRef -r REF/Human-GRCh38-release100 -x /share/project/m$

Cheers! Thanks

Clara

Launching reference build process. The full build might take hours. <Phase 1: STAR Reference Preparation> Jun 06 10:25:33 ... copying the genome FASTA file... Jun 06 10:25:53 ... copying the transcriptome GTF file... Jun 06 10:25:57 ... copying the STAR reference folder... <Phase 2: Mapability Calculation> Jun 06 10:27:59 ... mapping genome fragments back to genome... Jun 06 15:15:50 ... sorting aligned genome fragments... [bam_sort_core] merging from 72 files and 12 in-memory blocks... Jun 06 15:31:52 ... indexing aligned genome fragments... Jun 06 15:33:23 ... filtering aligned genome fragments by chromosome/scaffold... Jun 06 16:43:27 ... merging filtered genome fragments... Jun 06 16:44:57 ... calculating regions for exclusion... Jun 06 16:55:13 ... cleaning temporary files... <Phase 3: IRFinder Reference Preparation> Jun 06 16:55:22 ... building Ref 1... Jun 06 16:56:38 ... building Ref 2... Jun 06 16:56:44 ... building Ref 3... Jun 06 16:56:44 ... building Ref 4... ***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

Jun 06 16:57:46 ... building Ref 5... ***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

Jun 06 16:59:02 ... building Ref 6... Jun 06 16:59:04 ... building Ref 7... Jun 06 16:59:09 ... building Ref 8... Jun 06 16:59:11 ... building Ref 9... Jun 06 16:59:11 ... building Ref 10c... Jun 06 16:59:11 ... building Ref 11c...

dg520 commented 3 years ago

@ClariHC It seems the chromosome names in your GTF are not compatible with those in the FASTA. For example, one of them is named by "chr1, chr2, chr3..." while the other is named without "chr" as "1, 2, 3...". If that's the case, please make the naming in the same format and run the IRFinder reference preparation again.

ClariHC commented 3 years ago

Hi! thanks for the quick answer. If I do head of the gtf or the fasta they seem not to have different names. Or could be the second 1 after the chr1 1

head GRCh38.primary_assembly.genome.fa

chr1 1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (base) ###########

head gencode.v38.primary_assembly.annotation.gtf

description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2021-03-12

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";

dg520 commented 3 years ago

@ClariHC Is the first line in the FASTA read as chr1 1, instead of >chr1 1 ?

ClariHC commented 3 years ago

I miss copping the > is >chr 1 1

dg520 commented 3 years ago

@ClariHC Which version of bedtools are you using?

ClariHC commented 3 years ago

$ bedtools -version bedtools v2.27.1 (base)

dg520 commented 3 years ago

OK. Could you please try the following:

Could you please comment out the third last line in bin/util/Build-BED-refs.sh:
```
rm tmp.50 tmp-dir.IntronCover.bed tmp-nd.IntronCover.bed tmp.read-continues tmp.candidate.introns tmp.exons.exclude tmp.all.annotations tmp.reversed.genes tmp.ROI.rRNA.bed tmp.ROI.combined.bed
```
This will keep intermediate files during reference preparation, so that we can get a better understanding which exact step leading to the problem.
Re-run with bedtools v2.29 or 2.30 as the default bedtools call if possible.

Best, Dadi

ClariHC commented 3 years ago

Hi! I did both and run it again. I still have the same problem

Launching reference build process. The full build might take hours. <Phase 1: STAR Reference Preparation> Jun 07 09:20:38 ... copying the genome FASTA file... Jun 07 09:21:00 ... copying the transcriptome GTF file... Jun 07 09:21:06 ... copying the STAR reference folder... <Phase 2: Mapability Calculation> Jun 07 09:23:22 ... mapping genome fragments back to genome... Jun 07 14:11:31 ... sorting aligned genome fragments... [bam_sort_core] merging from 72 files and 12 in-memory blocks... Jun 07 14:28:19 ... indexing aligned genome fragments... Jun 07 14:29:55 ... filtering aligned genome fragments by chromosome/scaffold... Jun 07 15:39:47 ... merging filtered genome fragments... Jun 07 15:41:14 ... calculating regions for exclusion... Jun 07 15:49:23 ... cleaning temporary files... <Phase 3: IRFinder Reference Preparation> Jun 07 15:50:03 ... building Ref 1... Jun 07 15:51:14 ... building Ref 2... Jun 07 15:51:18 ... building Ref 3... Jun 07 15:51:18 ... building Ref 4... ***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

Jun 07 15:52:12 ... building Ref 5... ***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

***** WARNING: File introns.unique.bed has inconsistent naming convention for record: chr1 12227 12612 DDX11L1/ENSG00000223972.5/+ 0 +

Jun 07 15:53:18 ... building Ref 6... Jun 07 15:53:19 ... building Ref 7... Jun 07 15:53:24 ... building Ref 8... Jun 07 15:53:25 ... building Ref 9... Jun 07 15:53:25 ... building Ref 10c... Jun 07 15:53:25 ... building Ref 11c... IRFinder reference build: Failed!

ClariHC commented 3 years ago

Here are some of the heads of the tmp files! Best, Clara

dg520 commented 3 years ago

@ClariHC Sorry for backing on this late. I need a bit more insight. Could you please run the following lines in the REF/Human-GRCh38-release100/IRFinder directory:

cat \
<( cat exclude.omnidirectional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n | bedtools merge -i stdin | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "X", "0", "+"; print $1, $2, $3, "X", "0", "-" }'>
<( cat exclude.directional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n -k6,6 -u | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "E", "0", $6}' ) \
| \
sort -t $'\t' -S 1G -k1,1 -k2,2n -k3,3n -k6,6 \
> test1.bed

followed by:

bedtools intersect -s -sorted -wao -a introns.unique.bed -b test1.bed > test2.txt

Is this going to reproduce your error? Let me know if so, and please do keep test1.bed and test2.txt as we would need them for further debugging.

ClariHC commented 3 years ago

Hi! I'm really thankful for the help! :) Here is the output

In the first part, I had to add a parenthesis. I hope is in the right place.

clara@gateway ~/clara/apps/IRFinder-1.3.0/REF/Human-GRCh38-release100/IRFinder
$ cat \ <( cat exclude.omnidirectional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n | bedtools merge -i stdin | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "X", "0", "+"; print $1, $2, $3, "X", "0", "-" }')> <( cat exclude.directional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n -k6,6 -u | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "E", "0", $6}' ) \ | \ sort -t $'\t' -S 1G -k1,1 -k2,2n -k3,3n -k6,6 \ > test1.bed
cat: ' /dev/fd/63': No such file or directory
cat: ' ': No such file or directory

clara@gateway ~/clara/apps/IRFinder-1.3.0/REF/Human-GRCh38-release100/IRFinder
$ head test2.txt
GL000194.1      54832   55445   MAFIP/ENSG00000274847.1/-       0       -       .       -1      -1      0
GL000194.1      55676   112791  ENSG00000277400/ENSG00000277400.1/-     0       -       .       -1      -1      0
GL000194.1      112850  114985  ENSG00000277400/ENSG00000277400.1/-     0       -       .       -1      -1      0
GL000195.1      44923   48954   ENSG00000276256/ENSG00000276256.1/-     0       -       .       -1      -1      0
GL000195.1      174130  178159  ENSG00000278198/ENSG00000278198.1/+     0       +       .       -1      -1      0
GL000205.2      100563  104596  ENSG00000273496/ENSG00000273496.1/-     0       -       .       -1      -1      0
GL000213.1      108247  109883  ENSG00000277630/ENSG00000277630.4/-     0       -       .       -1      -1      0
GL000213.1      110007  118421  ENSG00000277630/ENSG00000277630.4/-     0       -       .       -1      -1      0
GL000213.1      118588  119628  ENSG00000277630/ENSG00000277630.4/-     0       -       .       -1      -1      0
GL000213.1      119673  121072  ENSG00000277630/ENSG00000277630.4/-     0       -       .       -1      -1      0
(base)
clara@gateway ~/clara/apps/IRFinder-1.3.0/REF/Human-GRCh38-release100/IRFinder
$ head test1.bed
(base)

dg520 commented 3 years ago

@ClariHC , Ah sorry, the command 1 is a bit broken, although it's not due to a missing parenthesis. Please try the following. To keep things more readable, please try the following commands to regenerate the two files:

cat exclude.omnidirectional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n | bedtools merge -i stdin | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "X", "0", "+"; print $1, $2, $3, "X", "0", "-" }' > imf1.bed

cat exclude.directional.bed | sort -t $'\t' -S 500M -k1,1 -k2,2n -k3,3n -k6,6 -u | awk 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2, $3, "E", "0", $6}' > imf2.bed

cat imf1.bed imf2.bed | sort -t $'\t' -S 1G -k1,1 -k2,2n -k3,3n -k6,6 > test1.bed 

bedtools intersect -s -sorted -wao -a introns.unique.bed -b test1.bed > test2.txt

Let me know if you can successfully generate test1.bed and test2.txt. If not, let me know where it fails and what is the error message. If there is any warning from the last command (bedtools intersect), please also paste it here.

Another thing I just found is that it seems you're not using the latest IRFinder, although that should be irrelevant to your current issue. Please try to git clone the latest version if possible.

Thanks!

ClariHC commented 3 years ago

Hi!

This is the warning I got from the bedtools intesect

$ bedtools intersect -s -sorted -wao -a introns.unique.bed -b test1.bed > test2.txt
***** WARNING: File introns.unique.bed has inconsistent naming convention for record:
chr1    12227   12612   DDX11L1/ENSG00000223972.5/+     0       +

***** WARNING: File introns.unique.bed has inconsistent naming convention for record:
chr1    12227   12612   DDX11L1/ENSG00000223972.5/+     0       +

dg520 commented 3 years ago

@ClariHC If you do the following:

bedtools intersect -s -sorted -wao -a <(grep "^chr" introns.unique.bed) -b <(grep "^chr" test1.bed) > test2.txt

Is it going to work without error?

ClariHC commented 3 years ago

It looks like I have no error when I run it.

bedtools intersect -s -sorted -wao -a <(grep "^chr" introns.unique.bed) -b <(grep "^chr" test1.bed) > test2.txt

dg520 commented 3 years ago

@ClariHC Great. I guess we figured out the problem then: your sorting is configured to something uncommon. In a more common situation, Linux will sort using alphabet in UTF-8 order. In that case, "chr1" should be sorted before "GL000194.1". Although most tools don't care which chromosome comes first if the coordinates within each chromosome have been sorted, bedtools indeed cares. It expects main chromosomes come before scaffolds for reasons (I'll not expand here), otherwise, it shows the WARNING you've seen.

To avoid breaking other tools on your system, I don't suggest you to change the language encoding for sorting. Instead, I would suggest you to get rid of non-main chromosomes (i.e. only keep chromsome 1-22, X, Y and chrM) from both FASTA and GTF files and re-build the IRFinder reference, if your research of interest is not on those chromosomes.

And before that, I'd strongly recommend you use the latest IRFinder if you haven't done so yet.

Let me know.

ClariHC commented 3 years ago

Dear @dg520 it looks like it worked. Thanks a lot for the help!

Cheers

dg520 commented 3 years ago

@ClariHC Glad it works. I'll close this thread for now.

williamritchie / IRFinder