williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Running BuildRef #181

Closed ian-bda closed 8 months ago

ian-bda commented 8 months ago

Hi I am trying to run the following command:

#!/bin/bash

/home5/ibirchl/IRFinder-2.0-beta/bin/IRFinder BuildRef -r REF/Zebrafish-GRCz11-release111 \
   ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/Danio_rerio.GRCz11.111.gtf.gz

but it keeps giving me the error:

Launching reference build process. The full build might take hours.
Trying to fetch dna.primary_assembly and GTF based on:
ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/Danio_rerio.GRCz11.111.gtf.gz

Warning: wildcards not supported in HTTP.
--2024-03-08 12:33:59--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.primary_assembly.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 12:34:02 ERROR 404: Not Found.

Warning: wildcards not supported in HTTP.
--2024-03-08 12:34:02--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.toplevel.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 12:34:04 ERROR 404: Not Found.

Failed to download fa.gz file.

Its probably just a simple formatting issue I'm missing but any help is greatly appreciated! Thanks

dg520 commented 8 months ago

@ian-bda

  1. Officially, there is no IRFinder 2.0. And in the official version, the call is supposed to be IRFinder -m BuildRef. But your command missed -m. So, I have to assume you are using an adapted version. Please note, a) I can only provide advice based on the official version, as I don't know what else has been changed in a customized version, and b) I am more than happy to look into whatever in the official version that does not work in an expected way. While I may provide suggestions for customization case by case, I won't guarantee they will always work, and I won't dig down why they don't work.
  2. According to the official version of IRFinder -m BuildRef, it expected a valid FTP address to an existing GTF file on the ENSEMBL server. Your input, ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/Danio_rerio.GRCz11.111.gtf.gz does NOT exist. The existing one is: ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz. I think you messed up the gtf and fastq folders on the FTP site.
ian-bda commented 8 months ago

Hi @dg520

Thanks for your quick response. No idea how I ended up with a custom version of IRFinder. Just re-downloaded it to get the correct version and changed the URL to the correct one. Here is my new script:

#!/bin/bash

/home5/ibirchl/IRFinder-1.3.0/bin/IRFinder -m BuildRef -r REF/Zebrafish-GRCz11-release111 \
ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz

Unfortunately I am still getting the following error:

Launching reference build process. The full build might take hours.
Trying to fetch dna.primary_assembly and GTF based on:
ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz

Warning: wildcards not supported in HTTP.
--2024-03-08 14:00:32--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.primary_assembly.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 14:00:36 ERROR 404: Not Found.

Warning: wildcards not supported in HTTP.
--2024-03-08 14:00:36--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.toplevel.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 14:00:38 ERROR 404: Not Found.

Failed to download fa.gz file.
dg520 commented 8 months ago

@ian-bda It works on my end. See the command and messages below:

(base) TESTMACHINE:~$ IRFinder -m BuildRef -r test_ref ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz
Launching reference build process. The full build might take hours.
Trying to fetch dna.primary_assembly and GTF based on:
ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz

--2024-03-08 13:27:12--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.primary_assembly.fa.gz
           => '.listing'
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-111/fasta/danio_rerio/dna ... done.
==> PASV ... done.    ==> LIST ... done.

.listing                          [ <=>                                              ]   9.16K  --.-KB/s    in 0.009s

2024-03-08 13:27:13 (1014 KB/s) - '.listing' saved [9379]

Removed '.listing'.
--2024-03-08 13:27:13--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
           => 'Danio_rerio.GRCz11.dna.primary_assembly.fa.gz'
==> CWD not required.
==> PASV ... done.    ==> RETR Danio_rerio.GRCz11.dna.primary_assembly.fa.gz ... done.
Length: 410230731 (391M)

Danio_rerio.GRCz11.dna.primar 100%[=================================================>] 391.23M  13.0MB/s    in 32s

2024-03-08 13:27:45 (12.2 MB/s) - 'Danio_rerio.GRCz11.dna.primary_assembly.fa.gz' saved [410230731]

--2024-03-08 13:27:45--  ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz
           => 'Danio_rerio.GRCz11.111.gtf.gz'
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-111/gtf/danio_rerio ... done.
==> SIZE Danio_rerio.GRCz11.111.gtf.gz ... 18347398
==> PASV ... done.    ==> RETR Danio_rerio.GRCz11.111.gtf.gz ... done.
Length: 18347398 (17M) (unauthoritative)

Danio_rerio.GRCz11.111.gtf.gz 100%[=================================================>]  17.50M  10.5MB/s    in 1.7s

2024-03-08 13:27:48 (10.5 MB/s) - 'Danio_rerio.GRCz11.111.gtf.gz' saved [18347398]

<Phase 1: STAR Reference Preparation>
Mar 08 13:27:59 ..... started STAR run
Mar 08 13:27:59 ... starting to generate Genome files

One possible issue is that your machine does not fully support FTP or HTTP. To rule out this, could you please run the following wget command and see if you can download the GTF file successfully?

wget ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz

Let me know.

ian-bda commented 8 months ago

@dg520 I ran wget ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz and it worked:

--2024-03-08 14:45:48--  ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz
           => 'Danio_rerio.GRCz11.111.gtf.gz.'
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-111/gtf/danio_rerio ... done.
==> SIZE Danio_rerio.GRCz11.111.gtf.gz ... 18347398
==> PASV ... done.    ==> RETR Danio_rerio.GRCz11.111.gtf.gz ... done.
Length: 18347398 (17M) (unauthoritative)

Danio_rerio.GRCz11.111.gtf. 100%[==========================================>]  17.50M   112KB/s    in 3m 59s  

2024-03-08 14:49:49 (74.9 KB/s) - 'Danio_rerio.GRCz11.111.gtf.gz.' saved [18347398]

Also tried rerunning the command exactly as you wrote it and am still getting the same error.

#!/bin/bash

/home5/ibirchl/IRFinder-1.3.0/bin/IRFinder -m BuildRef -r test_ref ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz
Launching reference build process. The full build might take hours.
Trying to fetch dna.primary_assembly and GTF based on:
ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz

Warning: wildcards not supported in HTTP.
--2024-03-08 14:53:08--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.primary_assembly.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 14:53:10 ERROR 404: Not Found.

Warning: wildcards not supported in HTTP.
--2024-03-08 14:53:10--  ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/*.dna.toplevel.fa.gz
Connecting to 192.168.1.20:3128... connected.
Proxy request sent, awaiting response... 404 Not Found
2024-03-08 14:53:12 ERROR 404: Not Found.

Failed to download fa.gz file.
dg520 commented 8 months ago

@ian-bda This tells FTP is supported, which is good. But wildcards in the address are not (e.g., ftp://test/*.fa). To make wildcards supported, you will have to consult and work with the IT admins who configure the machine you're working on.

Meanwhile, there is a workaround here and see if you want to adapt it for your workflow. Basically you need to run the follow:

mkdir REF/Zebrafish-GRCz11-release111  #This is the IRFinder reference folder you will stick to. Feel free to change it to other locations
cd  REF/Zebrafish-GRCz11-release111
wget ftp://ftp.ensembl.org/pub/release-111/gtf/danio_rerio/Danio_rerio.GRCz11.111.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
gunzip Danio_rerio.GRCz11.111.gtf.gz
gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
mv Danio_rerio.GRCz11.111.gtf transcripts.gtf
mv Danio_rerio.GRCz11.dna.primary_assembly.fa genome.fa
cd ../../
/home5/ibirchl/IRFinder-1.3.0/bin/IRFinder -m BuildRefProcess -r REF/Zebrafish-GRCz11-release111

Once the building process is completed, you can remove transcripts.gtf and genome.fa under REF/Zebrafish-GRCz11-release111 to save disk space.