wenzelm / slidr-sloppr

GNU General Public License v3.0
2 stars 0 forks source link

Returning a NULL data.table #1

Closed fayweili closed 3 years ago

fayweili commented 3 years ago

Hi,

I'm trying to run the example data set, but got this error message:

./slidr.sh -g toy_data/GCF_000002985.6_WBcel235_genomic.fna -a toy_data/GCF_000002985.6_WBcel235_genomic.gff -o slidr_toy_data -m libraries_config.txt -c 4 > log

Warning message: In fread(cmd = paste0("zcat '", filterin, "'"), sep = "\t", header = F, : File '/tmp/RtmpJ0A33B/fileb9075da4e74f' has size 0. Returning a NULL data.table. Error in setkeyv(m, c("Centroid", "BLASTN_Match", "Outron_Overlap")) : some columns are not in the data.table: Centroid,BLASTN_Match Execution halted

The 3-results-x1.0-l8-e1-R80-DGT-S.{20,60}AT{4,6}G-L35-AAG-O10 is empty.

Thanks!!

Fay-Wei

wenzelm commented 3 years ago

Hi Fay-Wei,

Thank you for using SLIDR!

This error happens because the file 2-RNA_filters/SL.filters-x1.0-l8-e1-R80-DGT-S.{20,60}AT{4,6}G-L25-AAG-O10.txt.gz is empty. Some step in the pipeline didn't work, probably because a dependency was not available.

Could you check the various log files in the output directory and see which step failed? I'll change the code to try and make such errors more obvious.

Best wishes, Marius

fayweili commented 3 years ago

Thanks for the quick response! Here is the log file:

#
# SLIDR - Spliced Leader IDentification from RNA-Seq data
# Version 1.1
#
# General:
#   > Output directory   slidr_toy_data
#   > SL name prefix     SL
#   > Threads        4
#
# Reference:
#   > Genome         toy_data/GCF_000002985.6_WBcel235_genomic.fna
#   > Annotations    toy_data/GCF_000002985.6_WBcel235_genomic.gff
#   > Transcriptome  
#
# Read tail filtering:
#   > Minimum length     8
#   > Maximum length     1.0
#   > BLASTN E-value     1
#
# SL RNA filtering:
#   > Splice donor   GT
#   > Sm binding site    .{20,60}AT{4,6}G
#   > Splice acceptor    AG
#   > SL RNA length  80
#   > Acceptor overlap   10
#   > Stem-loop span     35
#
# RNA-Seq data:
#   > Libraries      libraries_config.txt
#   * ERR2756729 (-r x) -1 toy_data/ERR2756729_1.fastq.gz -2 toy_data/ERR2756729_2.fastq.gz trim 
#   * ERR2756730 (-r x) -1 toy_data/ERR2756730_1.fastq.gz -2 toy_data/ERR2756730_2.fastq.gz trim 
#
[2021-01-23 11:05:05] >>> Loading dependencies
[2021-01-23 11:05:05] >>> STAGE 1: Read tail extraction
[2021-01-23 11:05:05] Converting GFF to GTF ...
[2021-01-23 11:05:21] Generating HISAT2 index ...
[2021-01-23 11:09:40] >>> Processing library ERR2756729
[2021-01-23 11:09:40] Trimming adapters and poor-quality bases from reads ...
[2021-01-23 11:10:15] Aligning reads to genome ...
[2021-01-23 11:11:20] 96.45% overall alignment rate
[2021-01-23 11:11:21] Inferred library strandedness: 2
[2021-01-23 11:11:21] Extracting soft-clipped tails from alignments ...
[2021-01-23 11:11:37] Extracted 36766 tails
[2021-01-23 11:11:37] >>> Processing library ERR2756730
[2021-01-23 11:11:37] Trimming adapters and poor-quality bases from reads ...
[2021-01-23 11:12:22] Aligning reads to genome ...
[2021-01-23 11:13:31] 96.86% overall alignment rate
[2021-01-23 11:13:33] Inferred library strandedness: 2
[2021-01-23 11:13:33] Extracting soft-clipped tails from alignments ...
[2021-01-23 11:13:51] Extracted 40551 tails
[2021-01-23 11:13:51] Collecting tails ...
[2021-01-23 11:13:51] Dereplicating tails ...
[2021-01-23 11:13:51] Dereplicated 22143 tails (>= 8 bp) to 9692 unique tails
[2021-01-23 11:13:51] Clustering tails ...
[2021-01-23 11:13:52] Identified 8574 tail clusters
[2021-01-23 11:13:52] >>> STAGE 2: SL RNA identification
[2021-01-23 11:13:52] Aligning tail cluster centroids to reference ...
[2021-01-23 11:14:48] Generated 5373 alignment records
[2021-01-23 11:14:48] Identifying splice donor and Sm binding sites (GT.{20,60}AT{4,6}G) ...
[2021-01-23 11:14:49] Identified 361 donor splice sites and Sm binding sites (6% of alignments)
[2021-01-23 11:14:49] Extracting splice acceptor sites (AG) ...
[2021-01-23 11:14:51] Extracted 15854 splice acceptor sites
[2021-01-23 11:14:51] Predicting RNA secondary structures ...
[2021-01-23 11:14:52] Processed 261 unique RNAs
[2021-01-23 11:14:52] >>> STAGE 3: Consensus SL construction
[2021-01-23 11:14:52] Collecting SL RNA filter results ...
[2021-01-23 11:14:52] Re-clustering SL candidates ...
[2021-01-23 11:14:52] Generated  clusters
[2021-01-23 11:14:52] Constructing final consensus SLs ...
[2021-01-23 11:14:53] Finished!
fayweili commented 3 years ago

and here shows the files it generated

image
wenzelm commented 3 years ago

Thank you, that's very useful. It's the final clustering step that failed for some reason. I'll check the code and will get back to you very shortly.

wenzelm commented 3 years ago

I've pushed an update to the code, but not sure if that fixed your problem. Could you please run git pull to get the latest version, remove the SL.filters-* files in the 2-RNA_filters folder and re-run the same SLIDR command as before?

If this doesn't work, I suspect your system does not like something about my join or sort commands. Let me know and I can try and work this out.

fayweili commented 3 years ago

the 3-* folder wasn't generated:

image
fayweili commented 3 years ago

I'm running this on Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-103-generic x86_64)

wenzelm commented 3 years ago

Yes, this hasn't fixed the problem (I switched on error catching, so the pipeline stopped once it encountered an error). Did you get some system error message in the log file?

Your system does not like lines 821-830 of the code in slidr.sh:

join -e '-' -t "$(printf '\t')" -1 1 -2 2 \
    <(zless $tailprefix.derep.clusterinfo.txt.gz | sort -S50% -k1,1 | uniq) \
    <(zless $clusterprefix.clusterinfo.txt.gz | sort -S50% -k 2,2 | uniq) \
    | sort -S50% -k3,3 | uniq \
| join -e '-' -t "$(printf '\t')" -1 3 -2 1 - <(zcat "$smout" | sort -S50% -k1,1 | uniq) \
    | sort -S50% -k3,3 | uniq \
| join -e '-' -t "$(printf '\t')" -1 3 -2 1 - <(zcat "$spliceout" | sort -S50% -k1,1 | uniq) \
    | sort -S50% -k 8,8 | uniq \
| join -e '-' -t "$(printf '\t')" -1 8 -2 1 - <(zcat "$rnafout" | sort -S50% -k1,1 | uniq) \
| gzip > $cand

I'm not sure why. Perhaps your sort utility does not like the -S option? You could try and edit the code to test it? I'll try and get my hands on an Ubuntu machine and will try and get to the bottom of this. Sorry about that!

wenzelm commented 3 years ago

Out of interest, what is the content of 2-RNA_filters/SL.filters-x1.0-l8-e1-R80-DGT-S.{20,60}AT{4,6}G-L25-AAG-O10.txt.gz? It's 20B instead of empty.

fayweili commented 3 years ago

I cannot zcat or less SL.filters-x1.0-l8-e1-R80-DGT-S.{20,60}AT{4,6}G-L35-AAG-O10.txt.gz. I believe the , and/or {} characters in the file name were the problem. Once I replace them I could see the file content, which is basically empty. Maybe this might also be the reason the pipeline failed?

wenzelm commented 3 years ago

I'm so sorry to keep you waiting. I think I've tracked down the error. What version of VSEARCH are you using? I've used a more recent version of VSEARCH now and I'm encountering the same issue as you (I think). I'll get back to you with a solution shortly, thank you for your patience.

wenzelm commented 3 years ago

OK, I've pushed an update to the code. It was a problem with VSEARCH in later versions. Please pull the latest code, remove all previous output and try the example data runs again. Please let me know if this fixed your issue.

fayweili commented 3 years ago

awesome. I was able to run through the example dataset without a problem. Now onto my own data... Thanks!!!