segmentation fault; recompilation doesn't fix

williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments

53 stars 25 forks source link

segmentation fault; recompilation doesn't fix #80

Closed glarue closed 4 years ago

glarue commented 4 years ago

Running on a single (SE) FASTQ file as in the manual fails (it fails the same, as far as I can tell, on a pre-built BAM):

roylab@trypsin /mnt/trypsin_extra/no_backup/glarue/physarum/physarum_extra/intron_retention $ IRFinder -r REF/ref-stp-aug-gffcmp -S ~/.installed_programs/STAR-2.7.3a/bin/Linux_x86_64/STAR -t 6 -d ERR557103_irf ./rna_seq/ERR557103.fastq.gz
WARNING: Novosort license has expired (in file /home/roylab/.installed_programs/novocraft/novoalign.lic), not sorting output.
/home/roylab/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/IRFinder: line 562: 21957 Broken pipe             "$STAREXEC" --genomeLoad $STARMEMORYMODE --runThreadN $THREADS --genomeDir "$REF/STAR" --outFilterMultimapNmax 1 --outSAMstrandField intronMotif --outFileNamePrefix "${OUTPUTDIR}/" --outSAMunmapped None --outSAMmode NoQS --outSAMtype BAM Unsorted --outStd BAM_Unsorted --readFilesIn "$1" $EXTRAREADFILESCOMMAND $EXTRAADAPT
     21958                       | tee "$OUTPUTDIR/Unsorted.bam"
     21959                       | gzip -cd
     21960 Segmentation fault      | "$LIBEXEC/irfinder" "$OUTPUTDIR" "$REF/IRFinder/ref-cover.bed" "$REF/IRFinder/ref-sj.ref" "$REF/IRFinder/ref-read-continues.ref" "$REF/IRFinder/ref-ROI.bed" "$OUTPUTDIR/unsorted.frag.bam" >> "$OUTPUTDIR/irfinder.stdout" 2>> "$OUTPUTDIR/irfinder.stderr"
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.

GCC and GLIBC are newer:

gcc (Debian 8.3.0-6) 8.3.0
ldd (Debian GLIBC 2.28-10) 2.28

Running the troubleshooting step of recompiling irfinder and then running a test results in the following:

roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug.broken/bin/util $ gzip -cd /mnt/trypsin_extra/no_backup/glarue/physarum/physarum_extra/intron_retention/DRR047256.bam | ./irfinder test $IRREF/IRFinder/ref-cover.bed $IRREF/IRFinder/ref-sj.ref $IRREF/IRFinder/ref-read-continues.ref $IRREF/IRFinder/ref-ROI.bed test/unsorted.frag.bam >> test/irfinder.stdout 2>> test/irfinder.stderr
Segmentation fault

Looking at the IRFinder error:

roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug.broken/bin/util $ cat test/irfinder.stderr 
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoul

In case it matters, these annotations are for a non-model organism and have been constructed in-house, which means that there may be some weird introns called in the annotations. Any help is greatly appreciated.

dg520 commented 4 years ago

Hi @glarue ,

Really appreciate these in-depth self troubleshooting steps. The problem is a weird one, namely stoul function is missing in your C++ libraries. stoul is a function comes with basic C++ 11. Another user using MAC OS had the same problem as yours. Which platform are you using? Is it a pure debian? Are you using C++ 11 or C++ 98?

Best, Dadi

glarue commented 4 years ago

@dg520 thank so much for the quick response.

I am using Deban 9, which I believe has C++ 11 compatibility built-in.

Strangely, I tried recompiling irfinder again and the same test command still produces a segfault, but now there is now no content in test/irfinder.stderr at all.

I am wondering whether this might be somewhat complicated by the fact that I believe I have a newer version of gcc than the default Debian 9 install (pulled from Debian 10 repos), although the compilation itself doesn't throw any errors.

glarue commented 4 years ago

Reading through the troubleshooting more, and while none of the BuildReference output files were empty, the number of scaffolds in my genome greatly exceeds the default ulimit -n value. I've changed the ulimit -n value and am re-running the reference building step to see whether that could fix things even though the ref. build step completed successfully before changing that value.

dg520 commented 4 years ago

Hi @glarue ，

If the building step was successful (and you are using the latest version of IRFinder), the downstream failure should not be derived from reference. Please note, old versions don't report reference failure in an obvious way (it continuously running till the end but returns a signal digit to indicate error ) and users might miss it.

But back to your question, I think the main problem is that you're missing the stoul funcion. Could you please try the following C++ code:

#include <cstring>
#include <time.h>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <limits>
#include <sys/types.h>
#include <vector>
#include <map>
#include <algorithm> // std::sort
#include <functional> // std::function

using namespace std;

int main()
{
    string str = "255";
    unsigned long num = stoul(str);
    cout << num << "\n";

    return 0;
}

Save the above code in a new file and name it as test.cpp. Then try to compile it using

g++ -o test test.cpp

I think you might encounter an error indicating 'stoul' was not declared. Let me know.

Best, Dadi

glarue commented 4 years ago

@dg520 unfortunately (or fortunately?) that compiles successfully using that command.

For reasons I don't understand, I am no longer able to get IRFinder to generate an error log mentioning stoul—running the suggested test within /util produces only Segmentation fault with no error log:

roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util $ echo $IRREF
/mnt/trypsin_extra/no_backup/glarue//physarum/physarum_extra/intron_retention/REF/ref-stp-aug-gffcmp
roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util $ gzip -cd /mnt/trypsin_extra/no_backup/glarue/physarum/physarum_extra/intron_retention/ERR557103.byName.bam | ./irfinder test $IRREF/IRFinder/ref-cover.bed $IRREF/IRFinder/ref-sj.ref $IRREF/IRFinder/ref-read-continues.ref $IRREF/IRFinder/ref-ROI.bed test/unsorted.frag.bam >> test/irfinder.stdout 2>> test/irfinder.stderr
Segmentation fault
roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util $ cat test/irfinder.stderr 
roylab@trypsin ~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util $

The behavior is the same for the pre-compiled irfinder and the self-compiled version.

dg520 commented 4 years ago

Hi @glarue ,

Before I running out of idea, could you please re-download the binary bin/util/irfinder from GitHub and over-write your current re-compiled one? Then please call the irfinder command to check whether it generates the stadard error mentioning stoul. If so, it might suggest your compiling might overcome the stoul error and bring up something else.

And during your re-compilation, is there anything suspicious?

I know I might ask for too much here, but would you mind also test the IRFinder (before and after your re-compiling) on the Ensembl human genome example below? If they regenerated the same error, it will rule out your GTF incompatibility.

#Build human reference
$ bin/IRFinder -m BuildRef -r REF/Human-hg19-release75 \      
    -e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz \      
    -b REF/extra-input-files/Human_hg19_wgEncodeDacMapabilityConsensusExcludable.bed.gz \      
    -R REF/extra-input-files/Human_hg19_nonPolyA_ROI.bed \      
    ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

#Get an example human FASTAQ data 
#You can use your local human reads as well
#and actually you can choose to only use the first 10000 lines of each read file 
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR900/SRR900288/SRR900288_1.fastq.gz     
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR900/SRR900288/SRR900288_2.fastq.gz

#Test
$ bin/IRFinder -r REF/Human-hg19-release75 -d test  SRR900288_1.fastq SRR900288_2.fastq

Best, Dadi

glarue commented 4 years ago

I know I might ask for too much here, but would you mind also test the IRFinder (before and after your re-compiling) on the Ensembl human genome example below?

Please, that's the least I can do—I sincerely appreciate your time on this.

I am running the test and will update, but I also did have a preliminary run done on a newer human genome assembly/annotation that worked fine before I encountered the error with the other genome, so I suspect something must be going on with this annotation.

I had to convert from GFF3 --> GTF, and manually added the gene_biotype and transcript_biotype entries. Aside from those tags, are there any other obvious gotchas I should be looking out for in the annotation file?

glarue commented 4 years ago

@dg520 update: the human genome test worked fine, which heavily implies it's something funky with my other non-human annotation.

See my comment above, but any suggestions for what to check in the annotation would be really helpful. As I mentioned before, the reference-building step seems to complete without error. After converting from GFF3 --> GTF, I added the appropriate gene/transcript tags to each gene/transcript/exon/CDS feature, e.g.

Scaffold0       transdecoder    exon    86600   86685   0       +       .       gene_id "XLOC_000006"; transcript_id "TCONS_00000005.p1"; Name "TCONS_00000005.p1"; gene_biotype "protein_coding"; transcript_biotype "processed_transcript";
Scaffold0       transdecoder    CDS     86600   86682   0       +       2       gene_id "XLOC_000006"; transcript_id "TCONS_00000005.p1"; Name "TCONS_00000005.p1"; gene_biotype "protein_coding"; transcript_biotype "processed_transcript";

Thanks again for your time in helping to troubleshoot this.

dg520 commented 4 years ago

Hi @glarue ,

Sorry for a late reply. For a GTF file, IRFinder explicitly looks for gene_name, gene_id, transcript_id, transcript_biotype and gene_biotype in every row marked as exon (the third column). For a GFF version 2 file, IRFinder explicitly looks for ID, gene_id, transcript_id, transcript_biotype and gene_biotype in every row marked as exon (the third column). For a GFF version 3 file, IRFinder explicitly looks for Name, gene_id, transcript_id, transcript_biotype and gene_biotype in every row marked as exon (the third column).

I might miss something else, but the entire expected set of attributes is defined in bin/util/gtf2bed-custom.pl

Your input looks like a GFF3 file instead of GTF. You might either change Name to gene_name (to make it a GTF) or change your file name from X.gtf to X.gff3.

I hope this is helpful.

Best, Dadi

glarue commented 4 years ago

Thanks again for your help @dg520, this definitely seems like it's probably an annotation issue (the number of times I've said that over the years!). I've tried changing Name --> gene_name, but the error remains. I did, however, notice upon re-running the reference building step that there seems to be an error thrown during that process, though the ref building itself doesn't fail:

Star genome build result: 0
Commence STAR mapping run for mapability.
Thu Apr  9 20:39:00 PDT 2020
awk: cmd. line:1: (FILENAME=- FNR=11705312) fatal: can't open pipe `gzip -c1 > tmp_by_chr_13878/Scaffold1832.bed.gz' for output (Resource temporarily unavailable)

real    13m41.192s
user    72m37.012s
sys     4m59.800s
Completed STAR run.

I will spend some time looking through the gtf2bed-custom.pl parsing and see if I can figure out anything else in the annotations that might be causing issues.

For what it's worth, the GFF3 --> GTF conversion was done using The Broad's TransDecoder pipeline (though in my experience the degree to which any given script actually "converts" between annotation formats is highly variable).

Out of curiosity (as I've spent quite a bit of time parsing various annotation formats), is gtf2bed-custom.pl perhaps built more around Ensembl-specific format expectations than the somewhat-loosely-defined specs themselves?

dg520 commented 4 years ago

Hi @glarue ,

This error message makes sense now. It's very likely because you have too many scaffolds in your reference. IRFinder indexed that many of temporary output file, however, some of them were not created due to the disk/memory limitation. How many threads are you using to run IRFinder, and how many memory is available to you when you start mapability stage?

In an extreme situation, you might want to try to change Line 32 of bin/util/Mapability from

--runThreadN $THREADS --outStd SAM --outSAMmode NoQS \

--runThreadN 1 --outStd SAM --outSAMmode NoQS \

This will force the mapability calculation stage using only one thread and thus should significantly reduce the memory usage. But this stage might take much longer. And make sure you have enough disk space as well.

For your GTF conversion question: gtf2bed-custom.pl coming with IRFinder is adapted from Google group. We made some changes to be compatible with IRFinder expectation. Both GTF and GFF format indeed have a quite clear definition now, which you can check on Ensembl site. The problem is more on the computational side for 2 reasons:

Most GFF<-->GTF conversion tools were built in early days and were not updated to the new definitions.
Most tools were only looked for specific features during convention, instead of converting all features. For example, some tools only looked for gene_id but inherited other attributes from the input directly. From a programmer point of view, this is a safe approach. It's challenging to convert everything, as not every attribute in GFF has a corresponding partner in GTF, and vice versa. Forcing to convert everyting would be potentially buggy in practice.

glarue commented 4 years ago

@dg520 thank you very much for the suggestions—I will look into fixing the memory allocation issue. I'm running on a small server with ~140 GB RAM, so I will try adjusting the threads to see if I can get it to run without errors.

Both GTF and GFF format indeed have a quite clear definition now, which you can check on Ensembl site. The problem is more on the computational side for 2 reasons:

Most GFF<-->GTF conversion tools were built in early days and were not updated to the new definitions.

Most tools were only looked for specific features during convention, instead of converting all features. For example, some tools only looked for gene_id but inherited other attributes from the input directly. From a programmer point of view, this is a safe approach. It's challenging to convert everything, as not every attribute in GFF has a corresponding partner in GTF, and vice versa. Forcing to convert everyting would be potentially buggy in practice.

The GFF3 spec is fairly thorough, but I'm not aware of any spec that consistently defines valid tag names for GTF/GFF2 (e.g. gene_name, etc). For example, this page (and some others) suggest that gene_id and transcript_id are required, but the degree to which even Ensembl annotations enforce this is variable (e.g. https://www.biostars.org/p/140034/). I suppose the takeaway is, as is often the case, that annotations are messy. :)

Thanks again for your continued help—I will update after investigating the reference-building issue.

glarue commented 4 years ago

This will force the mapability calculation stage using only one thread and thus should significantly reduce the memory usage. But this stage might take much longer. And make sure you have enough disk space as well.

Unfortunately, running with --runThreadN 1 in Mapability did not solve the issue:

sh: 1: Cannot fork
awk: cmd. line:1: (FILENAME=- FNR=11778344) fatal: print to "gzip -c1 > tmp_by_chr_27195/Scaffold1960.bed.gz" failed (Broken pipe)

Memory usage doesn't seem to have been the issue, as total usage never got above ~15%. Is the storage being used local to the run, or is it attempting to use some default temporary folder elsewhere? The drive where the run is occurring has ~800 GB of free space. Is it possible the first part of the error message regarding sh could be related to the second?

EDIT: I see that the temp directories are made locally, so it shouldn't be a space issue. Does it try to fork all processes in parallel? The genome file has 71231 scaffolds.

dg520 commented 4 years ago

Hi @glarue ,

Now it's a different error. The previous one when you used multiple-threading said resources not available. Now it says Broken pipe.

What the command does is to process each alignment, judge whether if it's a perfect match and then write it into a gzipped temporary file if so. It is designed to work with STAR so that each read is mapped and processed at the same time. In another word, it should fork as many as the number of threads you use at the same time. It's very weird you cannot fork with thread = 1.

I think it might be that the reads for STAR is not generated probably. Please bear with me and try the following:

Try to generate reads for STAR from your genome:
```
bin/util/generateReadsError.pl 70 10 < YOUR_GENOME.fa > test.fq
```
Can you successfully generate the test.fq without error?

Map reads and fork the result manually

STAR \
--genomeDir IRFINDER_REFERENCE_FOLDER/STAR \
--genomeLoad NoSharedMemory \
--runThreadN 1 --outStd SAM --outSAMmode NoQS \
--outSAMattributes None \
--outFilterMultimapNmax 1 \
--readFilesIn <("$LIBEXEC/generateReadsError.pl" 70 10 < "$FA") \
| \
awk  'BEGIN {FS="[\t!]"; OFS="\t"} (($8 == "70M") && ($3 == $6) && ($2 == $5)) {print $5, $6-1, $6+69 | ("gzip  -c1 >  "$5".bed.gz") }'

Can you successfully generate multiple bed.gz files? Each of them should start with your scaffold ID.

UPDATE: If you can successfully achieve the above, I really don't know why the fork process failed...

glarue commented 4 years ago

Now it's a different error. The previous one when you used multiple-threading said resources not available. Now it says Broken pipe.

Right you are—apologies, I've been teaching (remotely) all day and multitasking isn't an ideal paradigm for troubleshooting.

The first step you suggested appears to work, but the second fails with the same broken pipe: sh: 1: Cannot fork awk: cmd. line:1: (FILENAME=- FNR=14235477) fatal: print to "gzip -c1 > Scaffold5380.bed.gz" failed (Broken pipe)

It does, however, succeed in making many thousands of bed.gz files (5380 of them).

Oy vey.

dg520 commented 4 years ago

Hi @glarue ,

I think we're getting there.

How many scaffolds are there in your reference genome? Is it 5380? When the pipe broke, how many total files exist in the directory (including all Scaffold*.bed.gz and other files)?

If you remove Scaffold5380 and rerun the above two commands, will the second command fail at Scaffold5379? If yes, it might be file number limitation according to your disk. could you please do a -ulimit -a and attach here the entire output? If the two commands can be run, the problem might be in Scaffold5380.

EDIT: I mean remove Scaffold5380 from genome.fa file.

glarue commented 4 years ago

@dg520 okay so, the scaffold numbers are 0-indexed, and it was breaking during the creation of Scaffold5380.bed.gz previously. There are 71231 scaffolds in the genome.

I removed Scaffold5380 from the genome and re-ran the command; I was expecting it to break on 5381, but strangely it now breaks on 5382 instead, creating the 0-byte file Scaffold5382.bed.gz before quitting:

$ ~/.installed_programs/STAR-2.7.3a/bin/Linux_x86_64/STAR --genomeDir STAR --genomeLoad NoSharedMemory --runThreadN 1 --outStd SAM --outSAMmode NoQS --outSAMattributes None --outFilterMultimapNmax 1 --readFilesIn <(~/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util/generateReadsError.pl 70 10 < genome.fa) | awk  'BEGIN {FS="[\t!]"; OFS="\t"} (($8 == "70M") && ($3 == $6) && ($2 == $5)) {print $5, $6-1, $6+69 | ("gzip  -c1 >  "$5".bed.gz") }' &> log.txt
$ cat log.txt 
sh: 1: Cannot fork
awk: cmd. line:1: (FILENAME=- FNR=14235896) fatal: print to "gzip  -c1 >  Scaffold5382.bed.gz" failed (Broken pipe)

Here is the ulimit info (note that I had prophylactically upped the ulimit -n value during my troubleshooting leading up to my first issue post):

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 572136
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 71680
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 572136
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

dg520 commented 4 years ago

Hi @glarue ,

That's a lot of scaffolds! Anyway, your results suggest your genome fasta is fine and your system is capable to write out all chromsomes. The bottleneck seems to be the standard input/output buffer size. The workaround is to split your genome into two halves and generate mapability respectively.

Try to replace the following lines in bin/util/Mapability:

time "$STAREXEC" \
--genomeDir "$STARGENOME" \
--genomeLoad NoSharedMemory \
--runThreadN $THREADS --outStd SAM --outSAMmode NoQS \
--outSAMattributes None \
--outFilterMultimapNmax 1 \
--readFilesIn <("$LIBEXEC/generateReadsError.pl" 70 10 < "$FA") \
| \
awk -v tmpdir="$TMPCHR" -v tmpcmp="$TMPCMP" -v tmpext="$TMPEXT"  'BEGIN {FS="[\t!]"; OFS="\t"} (($8 == "70M") && ($3 == $6) && ($2 == $5)) {print $5, $6-1, $6+69 | ( tmpcmp " -c1 > " tmpdir "/" $5 ".bed." tmpext ) }'

with

("$LIBEXEC/generateReadsError.pl" 70 10 < "$FA") > genome_fragments.fq

echo Genome fragments generated.

time "$STAREXEC" \
--genomeDir "$STARGENOME" \
--genomeLoad NoSharedMemory \
--runThreadN $THREADS --outStd SAM --outSAMmode NoQS \
--outSAMattributes None \
--outFilterMultimapNmax 1 \
--readFilesIn genome_fragments.fq > genome_fragments.sam

echo Genome fragments aligned.

samtools view genome_fragments.sam \
| \
awk -v tmpdir="$TMPCHR" -v tmpcmp="$TMPCMP" -v tmpext="$TMPEXT"  'BEGIN {FS="[\t!]"; OFS="\t"}{gsub("Scaffold","",$5); gsub("Scaffold","",$2); if (($8 == "70M") && ($3 == $6) && ($2 == $5) && ($5 > 40000)) {print "Scaffold"$5, $6-1, $6+69 | ( tmpcmp " -c1 > " tmpdir "/Scaffold" $5 ".bed." tmpext ) }}'

echo Complete mapping fragments on Scaffold40001 - Scaffold71231.

samtools view genome_fragments.sam \
| \
awk -v tmpdir="$TMPCHR" -v tmpcmp="$TMPCMP" -v tmpext="$TMPEXT"  'BEGIN {FS="[\t!]"; OFS="\t"}{gsub("chr","",$5); gsub("chr","",$2); if (($8 == "70M") && ($3 == $6) && ($2 == $5) && ($5 <= 40000)) {print "Scaffold"$5, $6-1, $6+69 | ( tmpcmp " -c1 > " tmpdir "/Scaffold" $5 ".bed." tmpext ) }}'

echo Complete mapping fragments on Scaffold0 - Scaffold40000.

If you can see

Complete mapping fragments on Scaffold40001 - Scaffold71231.
Complete mapping fragments on Scaffold0 - Scaffold40000.

in the standard output, we've passed Step 1 of mapability calculation where you used to fail at. We might, or might not, have the same problem in Step 2. Let's see.

glarue commented 4 years ago

That's a lot of scaffolds!

Damned non-model organisms!

So, the two messages print, but we've still got a broken pipe issue:

Genome fragments aligned.
sh: 1: Cannot fork
awk: cmd. line:1: (FILENAME=- FNR=14130675) fatal: print to "gzip -c1 > tmp_by_chr_32195/Scaffold5300.bed.gz" failed (Broken pipe)
Complete mapping fragments on Scaffold40001 - Scaffold71231.
Complete mapping fragments on Scaffold0 - Scaffold40000.
Completed STAR run.
Sat Apr 11 15:52:44 PDT 2020
Commence Coverage calculation.

gzip: stdin: unexpected end of file

real    36m43.714s
user    32m8.916s
sys     11m59.548s

real    0m0.211s
user    0m0.148s
sys     0m0.072s
Completed coverage exclusion calculation.
Sat Apr 11 16:29:29 PDT 2020
Mapability result: 0
Build Ref 1
Build Ref 2
Build Ref 3
Build Ref 4
Build Ref 5
Build Ref 6
Build Ref 7
Build Ref 8
Build Ref 9
Build Ref 10c
Build Ref 11c
COMPLETE
Ref build result: 0
ALL DONE

dg520 commented 4 years ago

Hi @glarue ,

Do you still keep genome_fragments.sam? Let's do an experiment on it. Let's sort the SAM file first.

samtools sort genome_fragments.sam > genome_fragments.bam
samtools index genome_fragments.bam

Then please copy and run the following chunk of codes in your terminal:

i=1; \
time while [ $i -le 71231 ]; \
do \
samtools view genome_fragments.bam "Scaffold"$i > "Scaffold"$i.txt; \
cat "Scaffold"$i.txt|awk 'BEGIN{FS="[\t!]"; OFS="\t"}{if (($8 == "70M") && ($3 == $6) && ($2 == $5)) {print $5, $6-1, $6+69 | ("gzip -c1 >" $5 ".bed.gz") }}END{close( ("gzip -c1 >" $5 ".bed.gz"))}'; \
rm "Scaffold"$i.txt; \
((i++)); \
done

This will restrictly process one scaffold a time: read in only one scaffold, write to only one output and more importantly, close the output when EOF is encountered at the end of that scaffold records. Technically, this forces to only open and write one file at a time, thus the process can be slow. I hope it will generate all 71231 output without error for you.

P.S.: you might want to change 71231 to a smaller number, saying 1000 first, to see how long it will take to run the first 1000 scaffolds. From there you can estimate the time needed for running through the entire genome and let me know if that amount of time is feasible for you. BUT PLEASE NOTE, whenever you change the number, do make sure there is a space between the number and the following half bracket "]".

glarue commented 4 years ago

From there you can estimate the time needed for running through the entire genome and let me know if that amount of time is feasible for you.

Thank you for your continued assistance on this; should be done by ~3 pm PST this afternoon. I will update this post thereafter.

@dg520 okay, so the loop finished and produced 68187 .bed.gz files:

$ ls Scaffold*.bed.gz | wc -l
68187

While this doesn't represent the total number of scaffolds, it's close to the number of scaffolds with mapped reads in genome_fragments.sam:

$ grep -Pv '^@' genome_fragments.sam | awk '{print $3}' | grep 'Scaff' | sort -u | wc -l
68191

Looking at those scaffolds that were in genome_fragments.sam but aren't included in the bed.gz files, they are short and a couple of them are very repetitive, which may be why?

$ comm -23 fragment_scaffolds.txt processed_scaffolds.txt
Scaffold38490
Scaffold45502
Scaffold62832
Scaffold70615
$ for s in $(comm -23 fragment_scaffolds.txt processed_scaffolds.txt); do faseq ../genome.fa $s -u; done
>Scaffold38490
GTTTCATTATATGCAGCAACTAATACCTCACTCCACACCTGGCACCAGAGATTGGGCCAT
ATAGGTATTGACCGGTTAAAGGAGATGAGAAGAAATAAAGGTGCTACAGGTGTAGATTTT
CCAGATTCCGATATCCCCAATTTCTCCTGTGAAGTTTGCATCCTAGGCAAAGCACACAGG
CATGCCATCCTGAACAAGGAGAGAGAACGCGCTACCTACAAGGGAGAGCGGTTCCACGCT
GACACTTGCGGTCCTATGCAGGTGCCATCAGTAGGGGGGAAGAGATATCTAGTGATGCTT
GTGGACTGCTATAGTCGCAAGAAGTTCACCATACTCTCCAAAGACAAGAAGGAGATCAAG
GAAGGCGTACTCAATTTCATTGAAGG
>Scaffold45502
TATTTCTATATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATT
TCTATTTCTATATTTCTATATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTC
TATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATATTTCTATTTCTATTTCTATT
TCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATT
TCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATTTCTATT
TCTATTTCTATTTCTATTTCTATATTTTT
>Scaffold62832
ATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATT
ATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATT
ATTATTAGTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATT
ATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATT
>Scaffold70615
GGGCTGCCAGTGAGATCCTGGACTTTAGAGAGAAAAGGTGCCAGTCAAATACTGGTGGTG
GTGGACCCAATTTGCCAATCCAGAAACTCCAAACCAAAATCATGACTAATGGGAAGGTGA
TGTGTTGAGATTGCAAAGAAGGTGTAACTTGATTTGCCTGCTCCCTCTGTTGTGGGCAGG
TGTGGGCTGAAACCACTGTTGA

dg520 commented 4 years ago

Hi @glarue ,

Does your output contain Scaffold0 now? I forgot your index starts at 0. Glad you've fixed it. And you're right that the missing scaffolds are highly likely due to their repetitiveness in sequences. To confirm this, you can extract the alignment of these scaffolds. I believe either their alignment CIGAR strings are not 70M or they are not aligned to their original position in the genome (indicated by by the read name ).

Now seems we can pass the step1 of mapability calculation, although I'm still not sure if your original problem sits in too many opening files or insufficient stdout buffer.

I hope you keep all these bed.gz files. If so let's test if we can complete the second stage manually:

#make new folder to store temporary files
#avoiding reaching file number limitation 
mkdir tmp

#calculate regions to be excluded on each Scaffold according to the alignment
#also set xargs --max-procs to 1 for safety
#might be unnecessary as each call only opens a single output file here
#you might also have to replace '../STAR/chrNameLength.txt' with the true path
#which you can find under your STAR reference folder
time ls ./Scaffold*.bed.gz \
| \
xargs --max-args 1 --max-procs 1 -I{} bash -c \
"\
gzip -cd < {} \
| \
bedtools genomecov -i stdin -bga -g ../STAR/chrNameLength.txt \
| \
awk 'NR==1{chr=\$1;print}\$1==chr{print}' \
| \
awk 'BEGIN {FS=\"\t\"; OFS=\"\t\"} (\$4 < 5) {print \$1, \$2, \$3}' \
| \
bedtools merge -i stdin > tmp/{}.exclusion\
"

#merge all the exclusion regions into one final file with sorting
time cat tmp/*.exclusion | sort -S5G -k1,1 -k2,2n -k3,3n | gzip > MapabilityExclusion.bed.gz

If the above is successfully completed without error, we've done with mapability stage. The only file we need is MapabilityExclusion.bed.gz. And we can remove all the other temporary files:

rm genome_fragments.*
rm Scaffold*.bed.gz
rm -rf tmp

Now please check your IRFinder reference folder, which should contain 4 folders and two files

#the following four are folders
STAR
logSTARbuild
Mapability
IRFinder
#the following two are files
genome.fa
transcripts.gtf

They should be named exactly as indicated above. I believe you should already have them all. But if you miss Mapability, please make an empty one and you should put MapabilityExclusion.bed.gz under Mapability folder.

Finally we are going to manually make the IR reference. Please run the following in the terminal:

#go to your IRFinder reference folder
cd YOUR_IRFINDER_REFERENCE_FOLDER

#clean the pre-existing files under IRFinder folder
rm IRFinder/*

#replace the following four paths according to your settings and needs
IRFINDER_REF_FOLDER=YOUR_IRFINDER_REFERENCE_FOLDER;
IRFINDER_PROGRAM_FOLDER=/home/roylab/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/util;
ADDITIONAL_ROI="";   #optional, equals -R option in IRFinder main program
BLACKLIST="";              #optional, equals -b option in IRFinder main program

#make the reference
$IRFINDER_PROGRAM_FOLDER/Build-BED-refs.sh \
$IRFINDER_REF_FOLDER/transcripts.gtf \
$IRFINDER_REF_FOLDER/STAR/chrNameLength.txt \
$IRFINDER_REF_FOLDER/Mapability/MapabilityExclusion.bed.gz \
$IRFINDER_PROGRAM_FOLDER \
$ADDITIONAL_ROI \
$BLACKLIST

If you don't see any error, you should have IR reference file built under YOUR_IRFINDER_REFERENCE_FOLDER/IRFinder. Then you can re-run the quantification stage on your RNASeq samples.

Let me know if you still have problems in building reference. And I hope you don't encounter seg fault again during quantification.

glarue commented 4 years ago

@dg520

Thank you as always for your detailed instructions. The reference seems to have been built successfully at last (no errors); unfortunately, the quant. step throws a broken pipe followed by a number of segfaults:

$ IRFinder -r ref -S /home/roylab/.installed_programs/STAR-2.7.3a/bin/Linux_x86_64/STAR -t 1 -d ERR557103_irf ERR557103.fastq
WARNING: Novosort license has expired (in file /home/roylab/.installed_programs/novocraft/novoalign.lic), not sorting output.
/home/roylab/.installed_programs/IRFinder-1.2.6_hotfix_R_bug/bin/IRFinder: line 562:  3469 Broken pipe             "$STAREXEC" --genomeLoad $STARMEMORYMODE --runThreadN $THREADS --genomeDir "$REF/STAR" --outFilterMultimapNmax 1 --outSAMstrandField intronMotif --outFileNamePrefix "${OUTPUTDIR}/" --outSAMunmapped None --outSAMmode NoQS --outSAMtype BAM Unsorted --outStd BAM_Unsorted --readFilesIn "$1" $EXTRAREADFILESCOMMAND $EXTRAADAPT
      3470                       | tee "$OUTPUTDIR/Unsorted.bam"
      3471                       | gzip -cd
      3472 Segmentation fault      | "$LIBEXEC/irfinder" "$OUTPUTDIR" "$REF/IRFinder/ref-cover.bed" "$REF/IRFinder/ref-sj.ref" "$REF/IRFinder/ref-read-continues.ref" "$REF/IRFinder/ref-ROI.bed" "$OUTPUTDIR/unsorted.frag.bam" >> "$OUTPUTDIR/irfinder.stdout" 2>> "$OUTPUTDIR/irfinder.stderr"
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.
ERROR: IRFinder appears not to have completed. It appears an unknown component crashed.

As before, nothing in the irfinder.stderr file.

dg520 commented 4 years ago

Hi @glarue ,

May I check the following:

When building the IR reference, the transcripts.gtf file in your ref folder must be a GTF file with attributes compatible to IRFinder. Did you use your new GTF with attribution corrected? The pipeline could finish without error using incompatible GTF but the quant step might fail later.

If GTF is OK, could you please also check files under your ref/IRFinder? Make sure they are all non-empty.

ref/IRFinder/ref-cover.bed 
ref/IRFinder/ref-sj.ref 
ref/IRFinder/ref-read-continues.ref 
ref/IRFinder/ref-ROI.bed"

If you the above pass, could you please try the out-of-box IRFinder, without re-compiling the irfinder core?

If the problem persists, could you please share me the genome, GTF and the first 10,000 lines of FASTQ somewhere, so that I can fix it for you on my end?

glarue commented 4 years ago

Hey @dg520,

I've checked the above as best I can and tried running everything through with both the default and re-compiled irfinder binaries without success.

As far as I can tell from examining gtf2bed-custom.pl, I've modified the annotation file to match the requirements (but I am not fluent in Perl, so I'm concerned I might be missing something), e.g.

Scaffold0       transdecoder    gene    45      13421   .       -       .       ID "XLOC_000085"; gene_biotype "protein_coding"; gene_id "XLOC_000085"; gene_name "XLOC_000085";
Scaffold0       transdecoder    transcript      45      13421   .       -       .       ID "TCONS_00000253.p1"; Parent "XLOC_000085"; gene_biotype "protein_coding"; gene_id "XLOC_000085"; gene_name "XLOC_000085"; original_biotype "mrna"; transcript_biotype "processed_transcript"; transcript_id "TCONS_00000253.p1";
Scaffold0       transdecoder    exon    45      87      .       -       .       ID "TCONS_00000253.p1.exon8"; Parent "TCONS_00000253.p1"; gene_biotype "protein_coding"; gene_id "XLOC_000085"; gene_name "XLOC_000085"; transcript_biotype "processed_transcript"; transcript_id "TCONS_00000253.p1";
Scaffold0       transdecoder    exon    9720    9778    .       -       .       ID "TCONS_00000253.p1.exon7"; Parent "TCONS_00000253.p1"; gene_biotype "protein_coding"; gene_id "XLOC_000085"; gene_name "XLOC_000085"; transcript_biotype "processed_transcript"; transcript_id "TCONS_00000253.p1";

I appreciate your offer to help—what is the best way to send you the relevant files?

dg520 commented 4 years ago

Hi @glarue ,

Thanks for your great efforts! Is it possible to share links of files with me through Google Doc or Dropbox? Let me know.

Best, Dadi

glarue commented 4 years ago

Sure- can you email me at [EDIT: removed] so I can send you links? I’d prefer not to post public links as the annotation is being used for a manuscript.

dg520 commented 4 years ago

Hi @glarue ,

Problem found: The original behavior of IRFinder is to collect coverage information of each reads, parallelly to STAR aligning process. In this way, the alignment of the first read in the FASTQ file also contains the BAM header information. In your genome, this header (i.e. all scaffolds) is a very very long string. At the same time, IRFinder core includes a module to write a shrunk (by replacing sequence and QS column with "*") BAM. This module is optional but we hard-coded to implement it every time. Unfortunately, this module, written in C++, allocates a memory size for each BAM line, which is far smaller than your header length. Thus IRFinder crashes after aligning the first reads when trying to write this shrunk BAM file. But fortunately, we can skip implementing this module. That'll solve your issue, at least on the 10K reads demo FASTQ.

Solution: Please open the wrapper/execution file bin/IRFinder and replace

"$OUTPUTDIR/unsorted.frag.bam"

everywhere with

"NULL"

Then you can re-run IRFinder with the entire RNASeq file and let me know if the problem persists.

Consequence: In this way, you will lose the unsorted.frag.bam in your output folder. I don't think that matters actually. But maybe it's a redundant module we should have not created it in the first place, as we already have Unsorted.bam in the output folder.

Best, Dadi

glarue commented 4 years ago

@dg520 I really appreciate the time you spent investigating this—your above suggestion seems to have allowed it to complete successfully on the full FASTQ file.

I am somewhat relieved that it wasn't a totally trivial issue, as I was worried I was missing something obvious with the annotation formatting. Thanks again for your time.