problem with gtf file - Githubissues

pavlo888 commented 3 years ago

Hi @zhxiaokang

It seems I am having a problem with the BAM step in my workflow.

I get the following output

`[Wed Jan 6 23:51:27 2021] rule featureCount: input: data/output/pva027/genome/bamFileSort/20-0357.sort.bam, data/example/ref/annotation/027-annot.gff output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary jobid: 2 wildcards: sample=20-0357

Job counts: count jobs 1 featureCount 1

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
  v1.6.4

//========================== featureCounts setting ===========================\						Input files : 1 BAM file				P 20-0357.sort.bam
Output file : 20-0357_count.tsv
Summary : 20-0357_count.tsv.summary
Annotation : 027-annot.gff (GTF)
Dir for temp files : output-pva/pva027/genome/countFile

Threads : 8
Level : meta-feature level
Paired-end : yes
Multimapping reads : not counted
Multi-overlapping reads : not counted
Min overlapping bases : 1

Chimeric reads : counted
Both ends mapped : not required

\============================================================================//

//================================= Running ==================================\ || || || Load annotation file 027-annot.gff ... || ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option.. The porgram has to terminate.

[Wed Jan 6 23:51:28 2021] Error in rule featureCount: jobid: 0 output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary

RuleException: CalledProcessError in line 109 of /home/sam/Downloads/RASflow/workflow/align_count_genome.rules: Command ' set -euo pipefail; featureCounts -p -T 8 -t exon -g ID -a data/example/ref/annotation/027-annot.gff -o output-pva/pva027/genome/countFile/20-0357_count.tsv data/output/pva027/genome/bamFileSort/20-0357.sort.bam && tail -n +3 output-pva/pva027/genome/countFile/20-0357_count.tsv | cut -f1,7 > temp.20-0357 && mv temp.20-0357 output-pva/pva027/genome/countFile/20-0357_count.tsv ' returned non-zero exit status 255. File "/home/sam/Downloads/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run Exiting because a job execution failed. Look above for error message ` Do you have any idea how I could fix it?

Cheers, Pablo

zhxiaokang commented 3 years ago

This is an error from featureCounts which seems not able to understand the GFF file you provided. Please check your GFF file and also the parameter "ATTRIBUTE" in config_main.yaml to make sure that they agree with each other.

pavlo888 commented 3 years ago

Hi @zhxiaokang,

I am providing a gff file directly from a Prokka output. Also, the ATTRIBUTE is set as "ID" instead of "gene_id". Nevertheless, I still get an error.

Maybe I should edit the gff file or convert it to some other file type?

Cheers, Pablo

zhxiaokang commented 3 years ago

Not sure whether this is the issue, but the error says: no features were loaded in format GTF, but you're actually using a GFF. Maybe try to use a GTF file instead of GFF?

And do you mind sharing the GFF file you're using? At least show some lines that include the "ATTRIBUTE".

pavlo888 commented 3 years ago

But Prokka doesn't provide gtf file. Also, gtf files are the old version of gff (gtf version 3, I think?)

I will share the file and the yaml line

pavlo888 commented 3 years ago

This is how the GFF files looks like (these are screenshots since I cannot upload a gff file here) Screenshot from 2021-01-07 19-58-19

And this is the line of the config yaml file

# genome and annotation files
GENOME: data/example/ref/genome/027-annot.fna
ANNOTATION: data/example/ref/annotation/027-annot.gff
ATTRIBUTE: ID  # the attribute used in annotation file. It's usually "gene_id", but double check that since it may also be "gene", "ID"...

zhxiaokang commented 3 years ago

Hi, I'm sorry for the bad news but it seems that both featureCounts and htseq-count are designed for GTF format: https://help.galaxyproject.org/t/problems-with-attributes-in-featurecounts-gff3-input-instead-of-gtf/3046/2

As I tried out the example data on a GFF3 file with both tools, and they reported similar errors as what you got that they couldn't find the correct attribute. You may try to convert the GFF3 file into GTF file first. Here are some tools for the conversion: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gff_to_gtf.md

Hope this helps.

pavlo888 commented 3 years ago

Hi @zhxiaokang

I am still having issues even when I have converted the gff file into gtf file using AGAT. My GTF file looks like this Screenshot from 2021-01-18 13-12-17

This is the error I obtain: `ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id' An example of attributes included in your GTF annotation is 'gene_id "nbis-gene-95"; transcript_id "CDACBCJP_00095_gene"; ID "nbis-exon-95"; Parent "CDACBCJP_00095_gene"; inference "ab initio prediction:Prodigal:002006" "similar to AA sequence:ISfinder:ISHaha5"; locus_tag "CDACBCJP_00095"; product "IS110 family transposase ISHaha5"; protein_id "gnl|X|CDACBCJP_00095";' The program has to terminate.

`

`Error in rule featureCount: jobid: 0 output: output-pva/pva027/genome/countFile/20-0357_count.tsv, output-pva/pva027/genome/countFile/20-0357_count.tsv.summary

RuleException: CalledProcessError in line 109 of /home/sam/Downloads/ilse/RASflow/workflow/align_count_genome.rules: Command ' set -euo pipefail; featureCounts -p -T 4 -t exon -g gene_id -a data/example/ref/annotation/027-annot.gtf -o output-pva/pva027/genome/countFile/20-0357_count.tsv data/output/pva027/genome/bamFileSort/20-0357.sort.bam && tail -n +3 output-pva/pva027/genome/countFile/20-0357_count.tsv | cut -f1,7 > temp.20-0357 && mv temp.20-0357 output-pva/pva027/genome/countFile/20-0357_count.tsv ' returned non-zero exit status 255. File "/home/sam/Downloads/ilse/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run Exiting because a job execution failed. Look above for error message `

zhxiaokang commented 3 years ago

The GTF file looks good to me, at least the part in the screenshot. Could you share the whole file? I suspect that there are some issues somewhere in the file (not in the screenshot part).

pavlo888 commented 3 years ago

do you have an e-mail for sending the file? I cannot upload here on github.

zhxiaokang commented 3 years ago

zhxiaokang@gmail.com

zhxiaokang commented 3 years ago

Hi, after testing with your GTF file, I found that featureCounts will throw this error when there are two fields with quotes in the "inference" part. In the ERROR message you posted above, there are "ab initio prediction:Prodigal:002006" and "similar to AA sequence:ISfinder:ISHaha5", and there are more such cases in your GTF file.

I'm trying to post this issue in their Google group but am still waiting to be admitted into the group.

For the time being, you may fix the issue by only cutting the gene_id part from the GTF file with such command: cut -d';' -f1 027-annot.gtf > 027-annot_only_gene_id.gtf And use 027-annot_only_gene_id.gtf instead. I have tested this strategy (only picking the gene_id part) on the example data in RASflow, it produced almost the same counts as using the original GTF file.

pavlo888 commented 3 years ago

hi @zhxiaokang

Thank you for providing the command for fixing the GTF file. It worked perfectly!

However, there seems to be another error yet on the DEA visualization step

`Error in glmFit.default(y = y$counts, design = design, dispersion = dispersion, : dispersion must be numeric Calls: DEA ... glmFit -> glmFit.DGEList -> glmFit -> glmFit.default In addition: Warning message: In estimateDisp.default(y = y$counts, design = design, group = group, : No residual df: setting dispersion to NA Execution halted [Wed Jan 20 09:27:00 2021] Error in rule DEA: jobid: 1 output: output-pva/pva027/genome/dea/countGroup/Untreated_gene_norm.tsv, output-pva/pva027/genome/dea/countGroup/Calcium_gene_norm.tsv, output-pva/pva027/genome/dea/DEA/dea_Untreated_Calcium.tsv, output-pva/pva027/genome/dea/DEA/deg_Untreated_Calcium.tsv

RuleException: CalledProcessError in line 38 of /home/sam/Downloads/ilse/RASflow/workflow/dea_genome.rules: Command ' set -euo pipefail; Rscript scripts/dea_genome.R ' returned non-zero exit status 1. File "/home/sam/Downloads/ilse/RASflow/workflow/dea_genome.rules", line 38, in __rule_DEA File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run Removing output files of failed job DEA since they might be corrupted: output-pva/pva027/genome/dea/countGroup/Untreated_gene_norm.tsv, output-pva/pva027/genome/dea/countGroup/Calcium_gene_norm.tsv Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/sam/Downloads/ilse/RASflow/.snakemake/log/2021-01-20T092655.885343.snakemake.log DEA is done! Start visualization of DEA results! Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 Rules claiming more threads will be scaled down. Job counts: count jobs 1 end 1 plot 2

[Wed Jan 20 09:27:00 2021] rule plot: input: output-pva/pva027/genome/dea/countGroup, output-pva/pva027/genome/dea/DEA output: output-pva/pva027/genome/dea/visualization/volcano_plot_Untreated_Calcium.pdf, output-pva/pva027/genome/dea/visualization/heatmap_Untreated_Calcium.pdf jobid: 1

Loading required package: plotscale hash-3.0.1 provided by Decision Patterns

Loading required package: GenomicFeatures Loading required package: BiocGenerics Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
colnames, colSums, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique,
unsplit, which, which.max, which.min

Loading required package: S4Vectors Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:hash’:

values, values<-

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: GenomicRanges Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘AnnotationDbi’

The following objects are masked from ‘package:hash’:

keys, keys<-

Loading required package: ggplot2 Loading required package: ggrepel Error in file(file, "rt") : cannot open the connection Calls: plot.volcano.heatmap -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'output-pva/pva027/genome/dea/DEA/dea_Untreated_Calcium.tsv': No such file or directory Execution halted [Wed Jan 20 09:27:06 2021] Error in rule plot: jobid: 1 output: output-pva/pva027/genome/dea/visualization/volcano_plot_Untreated_Calcium.pdf, output-pva/pva027/genome/dea/visualization/heatmap_Untreated_Calcium.pdf

RuleException: CalledProcessError in line 53 of /home/sam/Downloads/ilse/RASflow/workflow/visualize.rules: Command ' set -euo pipefail; Rscript scripts/visualize.R output-pva/pva027/genome/dea/countGroup output-pva/pva027/genome/dea/DEA output-pva/pva027/genome/dea/visualization ' returned non-zero exit status 1. File "/home/sam/Downloads/ilse/RASflow/workflow/visualize.rules", line 53, in __rule_plot File "/home/sam/anaconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/sam/Downloads/ilse/RASflow/.snakemake/log/2021-01-20T092700.946097.snakemake.log Visualization is done! RASflow is done! ` could you direct me on what do to fix it?

Cheers, Pablo

zhxiaokang commented 3 years ago

Hi Pablo, glad to hear that it works. Regarding the new issue, it actually happens in DEA, or rather, edgeR, since when I searched the error message, it led me to this post: https://github.com/nanoporetech/pipeline-transcriptome-de/issues/6 As mentioned here, the problem is that there are not enough replicates. So in your case, how many replicates do you have in each group?

pavlo888 commented 3 years ago

Hi @zhxiaokang

There are two conditions and I have just one replicate for each condition, so only two sets of reads. Does this mean I cannot go any further with the visualization?

Cheers, Pablo

zhxiaokang commented 3 years ago

With only one replicate, you actually can't do differential expression analysis (DEA) since that's not enough to make statistics sense.

pavlo888 commented 3 years ago

I was fearing that was the case. Thanks a lot for your help!!!

zhxiaokang / RASflow

problem with gtf file #18