zavolanlab / zarp

The Zavolab Automated RNA-seq Pipeline
https://zavolanlab.github.io/zarp/
Apache License 2.0
35 stars 5 forks source link

bug: kallisto genome quantification fails #159

Closed uniqueg closed 9 months ago

uniqueg commented 9 months ago

Describe the bug

When executing ZARP on the mouse sarcopenia dataset to be described as a use case in the manuscript, the following errors are produced:

InputFunctionException in rule genome_quantification_kallisto in file /path/to/zarp/workflow/rules/single_end.snakefile.smk, line 351:
Error:
  IndexError: index 0 is out of bounds for axis 0 with size 0
Wildcards:
  sample=3M_TSCmKO_RAPA_rep6
Traceback:
  File "/path/to/zarp/workflow/rules/single_end.snakefile.smk", line 364, in <lambda>
  File "/path/to/zarp/workflow/rules/common.smk", line 6, in get_sample
  File "/path/to/envs/zarp/lib/python3.11/site-packages/pandas/core/series.py", line 1108, in __getitem__

This is the rule raising the error (from workflow/rules/single_end.snakefile.smk):

rule genome_quantification_kallisto:
    """
        Quantification at transcript and gene level using Kallisto
    """
    input:
        reads=os.path.join(
            config["output_dir"],
            "samples",
            "{sample}",
            "{sample}.fq1.se.remove_polya.fastq.gz",
        ),
        index=lambda wildcards: os.path.join(
            config["kallisto_indexes"],
            get_sample("organism", search_id="index", search_value=wildcards.sample),
            "kallisto.idx",
        ),
    output:
        pseudoalignment=os.path.join(
            config["output_dir"],
            "samples",
            "{sample}",
            "quant_kallisto",
            "{sample}.se.kallisto.pseudo.sam",
        ),
        abundances=os.path.join(
            config["output_dir"],
            "samples",
            "{sample}",
            "quant_kallisto",
            "abundance.h5",
        ),
    shadow:
        "minimal"
    params:
        cluster_log_path=config["cluster_log_dir"],
        output_dir=lambda wildcards, output: os.path.dirname(output.pseudoalignment),
        fraglen=lambda wildcards: get_sample(
            "mean", search_id="index", search_value=wildcards.sample
        ),
        fragsd=lambda wildcards: get_sample(
            "sd", search_id="index", search_value=wildcards.sample
        ),
        directionality=lambda wildcards: get_directionality(
            get_sample("libtype", search_id="index", search_value=wildcards.sample),
            "kallisto",
        ),
        additional_params=parse_rule_config(
            rule_config,
            current_rule=current_rule,
            immutable=(
                "--single",
                "-i",
                "-o",
                "-l",
                "-s",
                "--pseudobam",
                "--fr-stranded",
                "--rf-stranded",
            ),
        ),
    container:
        "docker://quay.io/biocontainers/kallisto:0.48.0--h15996b6_2"
    conda:
        os.path.join(workflow.basedir, "envs", "kallisto.yaml")
    threads: 8
    resources:
        mem_mb=lambda wildcards, attempt: 6000 * attempt,
    log:
        stderr=os.path.join(
            config["log_dir"], "samples", "{sample}", current_rule + ".se.stderr.log"
        ),
    shell:
        "(kallisto quant \
        --single \
        -i {input.index} \
        -o {params.output_dir} \
        -l {params.fraglen} \
        -s {params.fragsd} \
        -t {threads} \
        {params.directionality} \
        {params.additional_params} \
        --pseudobam \
        {input.reads} > {output.pseudoalignment};) \
        2> {log.stderr}"

More specifically, this line:

get_sample("organism", search_id="index", search_value=wildcards.sample),

get_sample() is defined in workflow/rules/common.smk:

def get_sample(column_id, search_id=None, search_value=None):
    """Get relevant per sample information from samples table"""
    if search_id:
        if search_id == "index":
            return str(samples_table[column_id][samples_table.index == search_value][0])
        else:
            return str(
                samples_table[column_id][samples_table[search_id] == search_value][0]
            )
    else:
        return str(samples_table[column_id][0])

Specifically, this line is where the out of bounds error arises:

return str(samples_table[column_id][samples_table.index == search_value][0])

To Reproduce

Go to ZARP/f1000 in our group folder on the cluster login node. Execute run_zarp.sh.

Expected behavior

No error is raised and kallisto quantifies gene expression.

Screenshots

N/A

Additional context

Incidentally (or not), the line in get_sample() that raises the error is also among those that leads to warnings being raised in #158.

uniqueg commented 9 months ago

Turns out there was a typo in the sample name (that was corrected during execution) which caused this issue (rep6 instead of rep5).