xihaoli / STAARpipeline-Tutorial

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary
GNU General Public License v3.0
24 stars 17 forks source link

as.numeric(commandArgs(TRUE)[1]) #7

Closed pelinunal closed 2 years ago

pelinunal commented 2 years ago

Dear xihaoli,

To perform rare-variant analysis from genotype data, I have been sent batch jobs to our uni cluster using your R scripts, and I came until Individual_analysis, however, I have an issue about:

arrayid <- as.numeric(commandArgs(TRUE)[1]) print(arrayid) [1] NA

It returns as "NAreal" and the following codes don't work. Do you have any suggestions to solve this issue?

Thanks!

Here is R info;

sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

xihaoli commented 2 years ago

Hi @pelinunal,

Glad that you have solved this issue. The example scripts are designed for jobs that are submitted via SLURM. Please feel free to make any adjustments suitable for your cluster environment, and let me know if you have any other questions. Thanks!

Best, Xihao

pelinunal commented 2 years ago

Hi Xihao again, Thank you for your reply and I have another question about Individual_analysis.

The 'gruopid' returns as negative values and therefore start_loc/end_loc are negative.

I do not want to change anything that could affect the results. Could you help me please with this?

Thanks! Pelin

> print(arrayid)
 [1]  25  50  70  89 108 123 139 154 162 176 190 204 214 223 232 236 245 254 260 267 271 272
> print(chr)
[1] 22

>if (chr == 1){
  groupid <- arrayid
}else{
  groupid <- arrayid - cumsum(jobs_num$individual_analysis_num)[chr-1]
} 
>print(groupid)
 [1] -247 -222 -202 -183 -164 -149 -133 -118 -110  -96  -82  -68  -58  -49  -40  -36  -27  -18  -12   -5   -1    0

>start_loc <- (groupid-1)*10e6 + jobs_num$start_loc[chr]
>end_loc <- start_loc + 10e6 - 1
>end_loc <- min(end_loc,jobs_num$end_loc[chr])

>print(start_loc)
 [1] -2459409480 -2209409480 -2009409480 -1819409480 -1629409480 -1479409480 -1319409480 -1169409480 -1089409480  -949409480
[11]  -809409480  -669409480  -569409480  -479409480  -389409480  -349409480  -259409480  -169409480  -109409480   -39409480
[21]      590520    10590520

>print(end_loc)
[1] -2449409481
xihaoli commented 2 years ago

Hi @pelinunal,

Thanks for your question. Could you please print jobs_num in R, so that we can take a closer look and see why the groupid and start_loc/end_loc can be negative?

Best, Xihao

pelinunal commented 2 years ago

Thank you for reaching out quickly. Here it is;

> print(jobs_num)
   chr start_loc   end_loc individual_analysis_num sliding_window_num scang_num
1    1     72647 248937242                      25                 50       166
2    2     10366 242160055                      25                 49       162
3    3     11859 198179447                      20                 40       133
4    4     36843 189998586                      19                 38       127
5    5     13701 181361105                      19                 37       121
6    6  30000022 170741281                      15                 29        94
7    7     37325 159334706                      16                 32       107
8    8     61464 145075204                      15                 30        97
9    9  61665918 138220359                       8                 16        52
10  10     11736 133773617                      14                 27        90
11  11    123276 135076471                      14                 27        90
12  12     14550 133259472                      14                 27        89
13  13  18175067 114353606                      10                 20        65
14  14  18297664 106880915                       9                 18        60
15  15  19866420 101973541                       9                 17        55
16  16     10573  36192247                       4                  8        25
17  17    114930  83238872                       9                 17        56
18  18     44788  80262395                       9                 17        54
19  19     71266  58606381                       6                 12        40
20  20     60435  64331792                       7                 13        43
21  21  13165333  46696127                       4                  7        23
22  22  10590520  19999883                       1                  2         7
xihaoli commented 2 years ago

Hi @pelinunal,

Thanks for sharing. In this case, you are expected to submit 272 parallel (independent) jobs as sum(jobs_num$individual_analysis_num) = 272. The problem is that in each of your submitted job, arrayid should be a unique integer (between 1 and 272, for example, 25), rather than a numeric vector as you posted above.

Hope this helps.

Best, Xihao

pelinunal commented 1 year ago

Hi Xihao, I have a general problem with arrayid and arrayid_longmask.

We are submitting jobs via LSF cluster environment. arrayid <- as.numeric(R.utils::commandArgs(TRUE)[1]) turns as integer (empty).

For individual analysis, adding arrayid <- as.integer(sum(jobs_num$individual_analysis_num)) worked smoothly, as you suggested.

However, for the other steps, we have tried to define arrayid or arrayid_longmask via arrayid <- as.integer(sum(group.num.allchr)). The results turn as NULL.

Do you have any suggestion? How we can solve this problem?

Thanks! Pelin

xihaoli commented 1 year ago

Hi @pelinunal,

Thanks for your question. I may not be that familiar with the LSF cluster environment you are using. However, for the SLURM cluster environment, in the associated shell script for the job submission, we specify the following line of command:

#SBATCH --array=1-379 --mem=6000

In this case, the SLURM scheduler will submit 379 jobs in parallel. In each submitted job, there is a unique job id which is an integer between 1 and 379, such that in the R script, the following command

arrayid <- as.numeric(commandArgs(TRUE)[1])

will assign that unique integer to the variable arrayid. For your case, you can either try to figure out what is the analog of the command #SBATCH --array=1-379 --mem=6000 in the LSF cluster environment; or you can manually submit 379 jobs, and explicitly set arrayid <- 1, ..., arrayid <- 379 in each job.

Hope this helps.

Best, Xihao