Closed fayweili closed 3 years ago
Hi, I hope SLIDR was working well for you :)
Ah, good point about missing exon features. I have now implemented a new option that allows you to set the feature ID for read quantification; this complements the existing option for setting the meta-feature ID. If you pull the latest code for SLOPPR you can now do the following:
sloppr.sh -f CDS -F geneID
What I'm worried about is that your GFF snippet does not work in GFFREAD (which SLOPPR uses internally to convert to GTF), so it may be malformed. Either convert it to GTF yourself before running SLOPPR, or try renaming the file to ".gtf" to avoid the conversion. Please let me know how you get on.
Thanks, Marius
I'm getting this error message, regardless I use gff or gtf: ***** ERROR: Requested column 4, but database file stdin only has fields 1 - 0.
At what point of the pipeline does this error message appear? Could you paste the full log, please? Also, if you have managed to convert your gff to gtf, could you paste a few gtf lines, please?
My command:
~/bin/slidr-sloppr/sloppr.sh -f CDS -F geneID -s SL.fa -n 30 -p C73_ -c 8 -r x -g ../Monodopsis_C73_genome_v1.fa -a ../_annotation_v2/Monodopsis_C73_genome_v1_annotation_filtered.gff -1 ../_data_RNA_seq/Monodopsis_C073_S36_L002_R1_001_trimmed.fastq.gz -2 ../_data_RNA_seq/Monodopsis_C073_S36_L002_R2_001_trimmed.fastq.gz
First five lines of gff:
Monodopsis_C73_contig_1 AUGUSTUS mRNA 18365 20109 . - . ID=Monodopsis_C73_contig_1_g00010.t1;geneID=Monodopsis_C73_contig_1_g00010
Monodopsis_C73_contig_1 AUGUSTUS CDS 18365 18412 1.00 - 0 Parent=Monodopsis_C73_contig_1_g00010.t1
Monodopsis_C73_contig_1 AUGUSTUS CDS 18607 18805 1.00 - 1 Parent=Monodopsis_C73_contig_1_g00010.t1
Monodopsis_C73_contig_1 AUGUSTUS CDS 18908 19020 1.00 - 0 Parent=Monodopsis_C73_contig_1_g00010.t1
Monodopsis_C73_contig_1 AUGUSTUS CDS 19092 19593 1.00 - 1 Parent=Monodopsis_C73_contig_1_g00010.t1
Monodopsis_C73_contig_1 AUGUSTUS CDS 19839 19902 1.00 - 2 Parent=Monodopsis_C73_contig_1_g00010.t1
Monodopsis_C73_contig_1 AUGUSTUS CDS 20052 20109 0.57 - 0 Parent=Monodopsis_C73_contig_1_g00010.t1
Log:
#
# SLOPPR - Spliced-Leader-informed Operon Prediction from RNA-Seq data
# Version 1.1
#
# General:
# > Output directory SLOPPR_2021-02-01_10-57-17
# > Threads 8
# > Operon prefix C73_
# > Reference operons
#
# Reference:
# > Genome ../Monodopsis_C73_genome_v1.fa
# > Annotations ../_annotation_v2/Monodopsis_C73_genome_v1_annotation_filtered.gff
#
# SL quantification:
# > SL sequences SL.fa
# > Minimum SL tail 30
# > Error rate 0.09
# > Feature ID CDS
# > Meta-feature ID geneID
#
# Operon prediction:
# > SL-count aggregation geo
# > Zero SL counts remove
# > SL2-type SLs
# > Downstream SL2:SL1 ratio >= infinity
# > Upstream SL2-bias no
# > Intercistronic distance <= infinity
#
# RNA-Seq data
# > R1 reads ../_data_RNA_seq/Monodopsis_C073_S36_L002_R1_001_trimmed.fastq.gz
# > R2 reads ../_data_RNA_seq/Monodopsis_C073_S36_L002_R2_001_trimmed.fastq.gz
# > Quality trimming no
# > BAM alignments
# > Strandedness x
#
[2021-02-01 10:57:17] >>> Loading dependencies
[2021-02-01 10:57:17] >>> STAGE 1: SL read identification
[2021-02-01 10:57:17] Converting GFF to GTF ...
[2021-02-01 10:57:17] Extracting unique exons from GTF ...
*****
***** ERROR: Requested column 4, but database file stdin only has fields 1 - 0.
[2021-02-01 10:57:17] Generating HISAT2 index ...
If using gtf My command:
~/bin/slidr-sloppr/sloppr.sh -f CDS -F gene_id -s SL.fa -n 30 -p C73_ -c 8 -r x -g ../Monodopsis_C73_genome_v1.fa -a ../_annotation_v2/Monodopsis_C73_genome_v1_annotation_filtered.gtf -1 ../_data_RNA_seq/Monodopsis_C073_S36_L002_R1_001_trimmed.fastq.gz -2 ../_data_RNA_seq/Monodopsis_C073_S36_L002_R2_001_trimmed.fastq.gz
gtf:
Monodopsis_C73_contig_1 AUGUSTUS gene 18365 20109 . - . Monodopsis_C73_contig_1_g00010
Monodopsis_C73_contig_1 AUGUSTUS transcript 18365 20109 . - . transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010"
Monodopsis_C73_contig_1 AUGUSTUS stop_codon 18365 18367 . - 0 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 18365 18412 1 - 0 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 18607 18805 1 - 1 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 18908 19020 1 - 0 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 19092 19593 1 - 1 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 19839 19902 1 - 2 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
Monodopsis_C73_contig_1 AUGUSTUS CDS 20052 20109 0.57 - 0 transcript_id "Monodopsis_C73_contig_1_g00010.t1"; gene_id "Monodopsis_C73_contig_1_g00010";
The error message is the same as above
Thank you for this, very helpful. That was an easy bug; just a hardcoded "exon" I forgot to change. Please try the latest code and use -f CDS -F gene_id as before. I hope that fixed it.
The latest code also contains an update to SLIDR that leverages the bugfixed VSEARCH and now recovers more reads. You may want to try and rerun your SLIDR runs just to check whether you get more coverage.
Thank you for your patience; this has really helped improve both pipelines a lot!
To clarify, always use -F gene_id, regardless of whether you use a GFF or GTF as input. SLOPPR converts to GTF internally, so the -F option must reflect the GTF nomenclature.
Cool. It ran past that point now!
Hi!
Me again... I'm running sloppr on my dataset but realized my gff does not have 'exon' attribute. Here is what my gff looks like:
So my question is, can I simply replace
CDS
toexon
?Fay-Wei