zhilizheng / qsubshcom

qsub command for serveral cluster engine
MIT License
13 stars 2 forks source link

Running a large number of qsubshcom jobs sequentially when each job has a large number of arrays #8

Open kcstringer opened 3 months ago

kcstringer commented 3 months ago

I have a data of 7001 columns with column 1 being ID and all subsequent columns each represent one trait. I want to use R to calculate kendall's rank correlation for each pair of columns and extract correlation estimates and p values. The total number of calculations is 7000*6999/2 = 24496500.

I want to parallelize the task such that each array does the calculation for one of the 24496500 column pairs. However, I cannot specify -array=1-24496500 since 24496500 exceeds the QOSMaxSubmitJobPerUserLimit. The QOSMaxSubmitJobPerUserLimit in my institution is 20000. So I write a for loop in bash to submit qsubshcom jobs in batch with each batch having 20000 job arrays. Each batch will be RUN after the previous batach has been completed running.

Here is my parKD.R script:

#!/usr/bin/env Rscript

library(argparser)
library(data.table)
library(rstatix) # p_format
library(tidyverse)

args <- arg_parser("Parallel Kendall") %>%
    add_argument("--col1",
        help = "First input vector",
        type = "numeric"
    ) %>%
    add_argument("--col2",
        help = "Second input vector",
        type = "numeric"
    ) %>%
    add_argument("--outCor",
        help = "Output correlation estimates",
        type = "character"
    ) %>%
    add_argument("--outPval",
        help = "Output p value",
        type = "character"
    ) %>%
    parse_args()

data <- fread("<PATH>/2.5_RiemannianDist.txt")

test <- cor.test(data[[args$col1 + 1]], data[[args$col2 + 1]], method = "kendall")

pval <- p_format(test$p.value, accuracy = 1e-09)
cor <- test$estimate

fwrite(as.data.frame(cor), args$outCor)
fwrite(as.data.frame(pval), args$outPval)

Here is the parKD.sh script that does the calculation for each column pair:

#!/bin/bash

script_dir=$(dirname $(readlink -f $0))
logs_dir=${script_dir}/../../logs/4_kendall/parKD
results_dir=${script_dir}/../../results/4_kendall/parKD

cd ${logs_dir}

trait2=$(awk -F ',' -v task=$((TASK_ID)) 'NR==task {print $1}' ${results_dir}/T7025)
trait1=$(awk -F ',' -v task=$((TASK_ID)) 'NR==task {print $2}' ${results_dir}/T7025)

Rscript \
${script_dir}/parKD.R \
--col1 ${trait1} \
--col2 ${trait2} \
--outCor ${results_dir}/$(echo "CorT${trait1}T${trait2}") \
--outPval ${results_dir}/$(echo "pvalT${trait1}T${trait2}")

Finally, here is the run_parKD.sh file that submit qsubshcom jobs in batches with batch size of 20000 as limited by institution's QOSMaxSubmitJobPerUserLimit:

#!/bin/bash

total_jobs=24496500 
jobs_per_batch=20000
batch_count=$((total_jobs / jobs_per_batch))
remainder=$((total_jobs % jobs_per_batch))

previous_job_id=""

for ((i=0; i<batch_count; i++)); do
    start=$((i * jobs_per_batch + 1))
    end=$((start + jobs_per_batch - 1))
    job_name="KD$((i+1))"

    if [ -z "$previous_job_id" ]; then
        # First job doesn't need to wait
        job_id=`qsubshcom "bash parKD.sh" 1 500Mb $job_name 23:00:00 "-queue=intel --qos=huge -array=${start}-${end}"`
    else
        # Subsequent jobs wait for the previous job
        job_id=`qsubshcom "bash parKD.sh" 1 500Mb $job_name 23:00:00 "-queue=intel --qos=huge -array=${start}-${end} -wait=$previous_job_id"`
    fi

    previous_job_id=$job_id
    echo "Submitted job $job_name with ID $job_id"
done

if [ $remainder -gt 0 ]; then
    start=$((batch_count * jobs_per_batch + 1))
    end=$total_jobs
    job_name="KD$((batch_count+1))"

    job_id=`qsubshcom "bash parKD.sh" 1 500Mb $job_name 23:00:00 "-queue=intel --qos=huge -array=${start}-${end} -wait=$previous_job_id"`
    echo "Submitted final job $job_name with ID $job_id"
fi

For run_parKD.sh, if I set total_jobs=7 and jobs_per_batch=2, the job will be run in the way I want, i.e., run first batch with 2 job arrays, when this compltes, run the second batch, which when completes, will run the 3rd batch...

However, when I set total_jobs=24496500 and jobs_per_batch=20000, here is the first several lines I get in the console:

Submitted job KD1 with ID 4087124
sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Submitted job KD2 with ID 
sbatch: error: Batch job submission failed: Invalid job array specification
Submitted job KD3 with ID 

The trouble here is that only the first batch is successfully submitted. Since the batch size is 20000, which already reaches QOSMaxSubmitJobPerUserLimit, then the second and every subsequent batch cannot be submitted. My run_parKD.sh fails in this case in that what this script does is to first submit all jobs for queueing, and execute the jobs sequentially. But when batch 1 already uses up QOSMaxSubmitJobPerUserLimit, all other batches cannot be submitted for queueing.

If I specify each job manually by submitting a new job array when the previous job has finished, I have to do this 1225 times... ... that is impossible to do so.

My question is how to submit job arrays through qsubshcom such that all 24496500 calculations can be submitted only once with the constraint that QOSMaxSubmitJobPerUserLimit is only 20000.

Thank you.

Kieran

zhilizheng commented 3 months ago

Hi @kcstringer,

It is not a good practice to have a large number of jobs. Too many jobs would make the scheduler busy and out of response, so admin usually prohibit the users from running this.

Your submission would result in: very long wait time, very short runtime if the admin allow this.

You can reduce the number of job, by the loop in your R script. Take thousands or even more into one job, doesn't take long time to finish.

Regards, Zhili