williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Problem with Reference Building of Drosophila gtf file #157

Closed kethselly closed 2 years ago

kethselly commented 2 years ago

Hello, I've been attempting to build a reference for IRFinder. I'm using the BuildRef mode with the code below to build a reference for Drosophila melanogaster. I'm using MacOSX 11.6 (Big Sur), perl 5.32.1, samtools ver 1.14, bedtools ver 2.30. I think the version of gcc I have is older (4.2.1), but I wasn't trying to compile the software from source so I thought I would be OK without the update. I'm just wondering whether my laptop is perhaps not powerful enough for running this? I already mapped the reads using STAR on galaxy and was going to use the downloaded .bam files (-m BAM) in the future in combination with the IRFinder reference but I've had trouble getting the reference to build.

The code I was using is: IRFinder -m BuildRef -r /IRFinder/test3 -j 74 -t 8 ftp://ftp.ensembl.org/pub/release-105/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.105.gtf.gz Originally, I was getting an error about a command called readlink, but I used homebrew to install coreutils and then edited the IRFinder script to use greadlink instead and that seemed to work.

However, now when I run the script I'm still getting an error as shown below:

<Phase 1: STAR Reference Preparation> STAR --runMode genomeGenerate --genomeDir /IRFinder/test3/STAR --genomeFastaFiles /IRFinder/test3/genome.fa --sjdbGTFfile /IRFinder/test3/transcripts.gtf --sjdbOverhang 74 --runThreadN 8 &>> log-star-build-ref.log STAR version: 2.7.9a compiled: :/Users/cshl/data/STAR/STAR/source

Dec 12 23:52:32 ..... started STAR run Dec 12 23:52:32 ... starting to generate Genome files Dec 12 23:52:35 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=143726002, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 12 Dec 12 23:52:39 ... starting to sort Suffix Array. This may take a long time... Dec 12 23:52:40 ... sorting Suffix Array chunks and saving them to disk... Dec 12 23:53:15 ... loading chunks from disk, packing SA... Dec 12 23:53:18 ... finished generating suffix array Dec 12 23:53:18 ... generating Suffix Array index Dec 12 23:53:52 ... completed Suffix Array index Dec 12 23:53:52 ..... inserting junctions into the genome indices Dec 12 23:54:14 ... writing Genome to disk ... Dec 12 23:54:14 ... writing Suffix Array to disk ... Dec 12 23:54:15 ... writing SAindex to disk Dec 12 23:54:15 ..... finished successfully <Phase 2: Mapability Calculation> Dec 12 23:54:15 ... mapping genome fragments back to genome... Dec 12 23:55:29 ... sorting aligned genome fragments... [bam_sort_core] merging from 0 files and 8 in-memory blocks... Dec 12 23:55:33 ... indexing aligned genome fragments... Dec 12 23:55:34 ... filtering aligned genome fragments by chromosome/scaffold... xargs: illegal option -- - usage: xargs [-0opt] [-E eofstr] [-I replstr [-R replacements] [-S replsize]] [-J replstr] [-L number] [-n number [-x]] [-P maxprocs] [-s size] [utility [argument ...]] Mapability build: Failed!

Thanks so much for any help you all can give! ~Seth

dg520 commented 2 years ago

@kethselly xargs in MACOS behaves differently than that it does in Linux, particularly some arguments supported in Linux are not supported in MAC. This leads to your error.

The root of this error sits in Lines 50-51 of IRFinder/bin/util/Mapability. The application of xargs there is supposed to process each chromosome name read from chrName.txt file (in the STAR reference folder that has already been generated under the IRFinder reference folder assigned by your command) and parse it to the samtools view command specified in Line 51.

I'm not an expert of MAC, but the workaround could be using a for loop to read chromosome names in chrName.txt and feed them to samtools view one by one, instead of using xargs.

In addition, there is another xargs implementation in Lines 55-59, You might also need to replace that one with a for loop as well. Specifically, you need to divide the Lines 56 and 58 into two parts: 1) for the find "$TMPBED" -type f -name "*.bed.""$TMPEXT" part, you need to save the result to a temporary file. 2) for the xargs part, you need to use a for loop to read each line from your temporary file and parse it to either zcat or lzop.

IRFinder doesn't support MACOS currently, but I hope these tricks help you.

P.S. The advantage of using xargs is that it takes advantage of multi-threading, which is not possible by a default for loop. Therefore, using for loop could result in folds more time to get the job done. Fortunately, we're looping on the number of chromosomes here and I believe that's a handleable number in drosophila.

Best, Dadi

kethselly commented 2 years ago

Dadi,

Thanks so much for your reply and helping me figure this out. That makes sense. Before I attempt the for loop idea, I was wondering whether you thought something like homebrew findutils would work.

https://formulae.brew.sh/formula/findutils

and also here:

https://superuser.com/questions/467176/replacement-for-xargs-d-in-osx

There is a command in findutils that looks like it mimics xargs (called gxargs). I was thinking that this might work similarly to the replacement of readlink with greadlink that I mentioned in my original post. I'm just wondering whether I might be able to install this utility and then replace xargs in the Mapability script with gxargs and see if it works. Do you have any thoughts on this? I apologize for all the questions - I'm somewhat new to the command line.

Thanks again for all the help! ~Seth

dg520 commented 2 years ago

@kethselly Installing a GNU version of xargs sounds promising for me, as the original problem was due to the incompatibility between GNU and BSD systems. After installation, you just have to change xargs everywhere in the script to gxargs. If nothing raised to an error, I would think the problem solved.

kethselly commented 2 years ago

@dg520 Thanks! Downloading homebrew findutils and replacing xargs with gxargs seem to help get things a little further, but now there is a long list of other errors that seem to probably have the same source. The general format of the error is shown below (at the bottom of the code chunk), but the xxxxx.bed.gz number keeps changing:

<Phase 2: Mapability Calculation>
Dec 16 20:26:16 ... mapping genome fragments back to genome...
Dec 16 20:27:27 ... sorting aligned genome fragments...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Dec 16 20:27:31 ... indexing aligned genome fragments...
Dec 16 20:27:32 ... filtering aligned genome fragments by chromosome/scaffold...
gxargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
Dec 16 20:27:51 ... merging filtered genome fragments...
zcat: can't stat: tmp_89407/2R2_mapped_Scaffold_56_D1828.bed.gz (tmp_89407/2R2_mapped_Scaffold_56_D1828.bed.gz.Z): No such file or directory
zcat: can't stat: tmp_89407/211000022280526.bed.gz (tmp_89407/211000022280526.bed.gz.Z): No such file or directory
zcat: can't stat: tmp_89407/211000022279839.bed.gz (tmp_89407/211000022279839.bed.gz.Z): No such file or directory

.... the long list ends with:

zcat: can't stat: tmp_89407/211000022279121.bed.gz (tmp_89407/211000022279121.bed.gz.Z): No such file or directory
Mapability build: Failed!

I'm wondering whether this now has to do with a long list of scaffolds in the most recent Drosophila .fa genome and .gtf file that as far as I understand them, are repetitive sequences that are not able to be assembled into the overall genome assembly. I'm thinking of deleting these extra sequences from the .fasta file and then using the -m BuildRefProcess option rather than the -m BuildRef option (so I don't have to keep redownloading the same file.

Is there something else you would recommend trying first or instead though?

Thanks so much for your help on this.

~Seth

kethselly commented 2 years ago

@dg520 Another update: I tried removing those extra scaffolds and that didn't seem to really help with the errors. I kept digging into this and I think the issue was still with xargs. After looking around online I found this post (https://superuser.com/questions/624462/xargs-replace-i-for-single-arguments) which suggested using an extra pipe. So I tried that as follows:

cat "$STARGENOME/chrName.txt" | \
    gxargs --max-args 1 | gxargs --max-procs "$THREADS" -I{} bash -c "samtools view genome_fragments.bam {}|awk -v tmpdir=\"$TMPBED\" -v tmpcmp=\"$TMPCMP\" -v tmpext=\"$TMPEXT\" 'BEGIN{FS=\"[\\t!]\"; OFS=\"\\t\"}{if ((\$8 == \"70M\") && (\$3 == \$6) && (\$2 == \$5)) {print \$5, \$6-1, \$6+69 | (tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ) }}END{close( (tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ))}'"

This helped, but then I got another error message that seemed to be from trying to use zcat. The coreutils package I had installed earlier from homebrew also contains a similar command called gzcat, so I altered this code to the following:

if [ "$TMPEXT" == "gz" ]; then
    find "$TMPBED" -type f -name "*.bed.""$TMPEXT"|gxargs --max-args 1 gzcat >> genome_fragments.unsorted.bed
elif [ "$TMPEXT" == "lzo" ]; then
    find "$TMPBED" -type f -name "*.bed.""$TMPEXT"|gxargs --max-args 1 lzop -cdf >> genome_fragments.unsorted.bed
fi

Again - I here also replaced xargs with gxargs too.

This seems to work. I ran the following command: IRFinder -m BuildRefProcess -r /IRFinder/test2 -j 74 -t 8

and then received the following output:

`<Phase 2: Mapability Calculation>
Dec 16 23:25:34 ... mapping genome fragments back to genome...
Dec 16 23:26:39 ... sorting aligned genome fragments...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Dec 16 23:26:44 ... indexing aligned genome fragments...
Dec 16 23:26:44 ... filtering aligned genome fragments by chromosome/scaffold...
Dec 16 23:27:02 ... merging filtered genome fragments...
Dec 16 23:27:03 ... calculating regions for exclusion...
Dec 16 23:27:10 ... cleaning temporary files...
<Phase 3: IRFinder Reference Preparation>
Dec 16 23:27:10 ... building Ref 1...
* Input error: Chromosome Unmapped_Scaffold_4_D1555_D1692 doesn't present in the .genome file. *
Dec 16 23:27:13 ... building Ref 2...
Dec 16 23:27:13 ... building Ref 3...
Dec 16 23:27:13 ... building Ref 4...
Dec 16 23:27:14 ... building Ref 5...
* Input error: Chromosome Unmapped_Scaffold_8_D1580_D1567 doesn't present in the .genome file. *
Dec 16 23:27:16 ... building Ref 6...
Dec 16 23:27:16 ... building Ref 7...
Dec 16 23:27:17 ... building Ref 8...
* Input error: Chromosome Unmapped_Scaffold_8_D1580_D1567 doesn't present in the .genome file. *
Dec 16 23:27:17 ... building Ref 9...
Dec 16 23:27:17 ... building Ref 10c...
Dec 16 23:27:17 ... building Ref 11c...`

So - does this mean the reference has been built correctly?

Thanks again!

dg520 commented 2 years ago

@kethselly This is a fantastic insight into the BSD-based solution! I apologize for a late reply. I do believe the reference building is successful now.

kethselly commented 2 years ago

Great! Thanks!