Open arsenij-ust opened 1 year ago
Hi Arsenij,
My suspicion is that: at some point, I added htseq-count
but didn't test the whole workflow properly. featurecounts
would output a summary file but not htseq-count
. If you do not insist on summary report, the last rule for report can be removed. Therefore you can replace the align_count_genome.rules
with the following content:
import pandas as pd
configfile: "configs/config_main.yaml"
samples = pd.read_csv(config["METAFILE"], sep = '\t', header = 0)['sample']
trimmed = config["TRIMMED"]
if trimmed:
input_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/trim"
else:
input_path = config["READSPATH"]
key = config["KEY"]
end = config["END"]
intermediate_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/genome"
final_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome"
alignmentQC = config["alignmentQC"]
rule end:
input:
count = expand(final_path + "/countFile/{sample}_count.tsv", sample = samples)
if end == "pair":
rule getReads:
output:
forward = temp(intermediate_path + "/reads/{sample}_forward.fastq.gz"),
reverse = temp(intermediate_path + "/reads/{sample}_reverse.fastq.gz")
params:
key = key,
input_path = input_path
run:
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")
else:
rule getReads:
output:
read = temp(intermediate_path + "/reads/{sample}.fastq.gz")
params:
key = key,
input_path = input_path
shell:
"""
shopt -s extglob
scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
"""
rule indexGenome:
input:
genome = config["GENOME"]
output:
indexes = directory(intermediate_path + "/indexes"),
splicesites = intermediate_path + "/splicesites.txt"
params:
index = intermediate_path + "/indexes/index"
shell:
"mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}"
"&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}"
if end == "pair":
rule alignment:
input:
indexes = directory(intermediate_path + "/indexes"),
splicesites = intermediate_path + "/splicesites.txt",
forward = intermediate_path + "/reads/{sample}_forward.fastq.gz",
reverse = intermediate_path + "/reads/{sample}_reverse.fastq.gz"
output:
sam = temp(intermediate_path + "/samFile/{sample}.sam"),
bam = temp(intermediate_path + "/bamFile/{sample}.bam")
params:
index = intermediate_path + "/indexes/index"
benchmark:
intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
run:
shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}")
shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
else:
rule alignment:
input:
indexes = directory(intermediate_path + "/indexes"),
splicesites = intermediate_path + "/splicesites.txt",
forward = intermediate_path + "/reads/{sample}.fastq.gz"
output:
sam = temp(intermediate_path + "/samFile/{sample}.sam"),
bam = temp(intermediate_path + "/bamFile/{sample}.bam")
params:
index = intermediate_path + "/indexes/index"
benchmark:
intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
run:
shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}")
shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
rule sortBAM:
input:
bam = intermediate_path + "/bamFile/{sample}.bam"
output:
sort = intermediate_path + "/bamFileSort/{sample}.sort.bam"
shell:
"samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}"
rule featureCount:
input:
sort = intermediate_path + "/bamFileSort/{sample}.sort.bam",
annotation = config["ANNOTATION"]
output:
count = final_path + "/countFile/{sample}_count.tsv"
run:
if config["COUNTER"]=="featureCounts":
if config["END"]=="pair":
shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
else:
shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
elif config["COUNTER"]=="htseq-count":
shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")
Let me know if that does not work for you.
Hi @zhxiaokang, thanks for the fast reply and the provided solution. This is exactly how I fixed the problem. But it's of course only a temporary solution.
With kind regards Arsenij
Hi Arsenij, sure, that's only temporary. Will fix it when I have time for this. Will keep this issue open and close it when it's fixed, in a good way.
Hi @zhxiaokang
i run into a problem when i execute the pipeline with the paired-end test data using htseq-count. There appears the following warning and finally the Error caused by missing file:
The problem is, that the file
SRR1039509_count.tsv.summary
is missing but I don't know if the warning of htseq-count is related to this problem.When I execute the isolated command
htseq-count -f bam -i gene_id -s no -t exon output/test/test/genome/bamFileSort/SRR1039509.sort.bam data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf | sed '/^__/ d' > output/test/genome/countFile/SRR1039509_count.tsv
only the SRR1039509_count.tsv file is created.With kind regards Arsenij