htseq-count test data error

arsenij-ust commented 1 year ago

Hi @zhxiaokang

i run into a problem when i execute the pipeline with the paired-end test data using htseq-count. There appears the following warning and finally the Error caused by missing file:

        count   jobs
        1       featureCount
        1
17933 GFF lines processed.

Warning: Read SRR1039509.3494278 claims to have an aligned mate which could not be found in an adjacent line.
Warning: 81048 reads with missing mate encountered.

89918 SAM alignment pairs processed.
MissingOutputException in line 99 of /data/tools/RASflow/workflow/align_count_genome.rules:
Missing files after 5 seconds:
output/test/genome/countFile/SRR1039509_count.tsv.summary
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /data/tools/RASflow/.snakemake/log/2022-11-09T132244.600008.snakemake.log
DEA is not required and RASflow is done!

The problem is, that the file SRR1039509_count.tsv.summary is missing but I don't know if the warning of htseq-count is related to this problem.

When I execute the isolated command

htseq-count -f bam -i gene_id -s no -t exon output/test/test/genome/bamFileSort/SRR1039509.sort.bam data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf | sed '/^__/ d' > output/test/genome/countFile/SRR1039509_count.tsv only the SRR1039509_count.tsv file is created.

With kind regards Arsenij

zhxiaokang commented 1 year ago

Hi Arsenij,

My suspicion is that: at some point, I added htseq-count but didn't test the whole workflow properly. featurecounts would output a summary file but not htseq-count. If you do not insist on summary report, the last rule for report can be removed. Therefore you can replace the align_count_genome.rules with the following content:

import pandas as pd
configfile: "configs/config_main.yaml"

samples = pd.read_csv(config["METAFILE"], sep = '\t', header = 0)['sample']
trimmed = config["TRIMMED"]
if trimmed:
    input_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/trim"
else:
    input_path = config["READSPATH"]
key = config["KEY"]
end = config["END"]
intermediate_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/genome"
final_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome"
alignmentQC = config["alignmentQC"]

rule end:
    input:
        count = expand(final_path + "/countFile/{sample}_count.tsv", sample = samples)

if end == "pair":
    rule getReads:
        output:
            forward = temp(intermediate_path + "/reads/{sample}_forward.fastq.gz"),
            reverse = temp(intermediate_path + "/reads/{sample}_reverse.fastq.gz")
        params:
            key = key,
            input_path = input_path
        run:
            shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
            shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")

else:
    rule getReads:
        output:
            read = temp(intermediate_path + "/reads/{sample}.fastq.gz")
        params:
            key = key,
            input_path = input_path
        shell:
            """
            shopt -s extglob
            scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
            """

rule indexGenome:
    input:
        genome = config["GENOME"]
    output:
        indexes = directory(intermediate_path + "/indexes"),
        splicesites = intermediate_path + "/splicesites.txt"
    params:
        index = intermediate_path + "/indexes/index"
    shell:
        "mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}"
        "&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}"

if end == "pair":
    rule alignment:
        input:
            indexes = directory(intermediate_path + "/indexes"),
            splicesites = intermediate_path + "/splicesites.txt",
            forward = intermediate_path + "/reads/{sample}_forward.fastq.gz",
            reverse = intermediate_path + "/reads/{sample}_reverse.fastq.gz"
        output:
            sam = temp(intermediate_path + "/samFile/{sample}.sam"),
            bam = temp(intermediate_path + "/bamFile/{sample}.bam")
        params:
            index = intermediate_path + "/indexes/index"
        benchmark:
            intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
        run:
            shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}")
            shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
else:
    rule alignment:
        input:
            indexes = directory(intermediate_path + "/indexes"),
            splicesites = intermediate_path + "/splicesites.txt",
            forward = intermediate_path + "/reads/{sample}.fastq.gz"
        output:
            sam = temp(intermediate_path + "/samFile/{sample}.sam"),
            bam = temp(intermediate_path + "/bamFile/{sample}.bam")
        params:
            index = intermediate_path + "/indexes/index"
        benchmark:
            intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
        run:
            shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}")
            shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")

rule sortBAM:
    input:
        bam = intermediate_path + "/bamFile/{sample}.bam"
    output:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam"
    shell:
        "samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}"

rule featureCount:
    input:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam",
        annotation = config["ANNOTATION"]
    output:
        count = final_path + "/countFile/{sample}_count.tsv"
    run:
        if config["COUNTER"]=="featureCounts":
            if config["END"]=="pair":
                shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
            else:
                shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        elif config["COUNTER"]=="htseq-count":
            shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")

Let me know if that does not work for you.

arsenij-ust commented 1 year ago

Hi @zhxiaokang, thanks for the fast reply and the provided solution. This is exactly how I fixed the problem. But it's of course only a temporary solution.

With kind regards Arsenij

zhxiaokang commented 1 year ago

Hi Arsenij, sure, that's only temporary. Will fix it when I have time for this. Will keep this issue open and close it when it's fixed, in a good way.

zhxiaokang / RASflow

htseq-count test data error #32