xjtu-omics / HiCAT

HiCAT new project
Other
27 stars 2 forks source link

question about thread and memory #3

Closed Henry-Ding closed 1 year ago

Henry-Ding commented 1 year ago

hi, I encountered a problem with insufficient memory when using this software. Is the software's memory requirement unusual in step HiCAT_HOR.py? I installed the software using conda and used the data from testdata without error. the fellow is the error message: `ed distance thread: 52 Traceback (most recent call last): File "/miniconda3/envs/mamba/envs/hicat/bin/HiCAT_HOR.py", line 1591, in main() File "/miniconda3/envs/mamba/envs/hicat/bin/HiCAT_HOR.py", line 1497, in main edit_distance_matrix, block_name_index = calculateED(block_sequence, base_sequence,thread) File "/miniconda3/envs/mamba/envs/hicat/bin/HiCAT_HOR.py", line 68, in calculateED res = Parallel(n_jobs=thread)(delayed(ed_distance_apply_apply)(data, i) for i in split_in) File "/miniconda3/envs/mamba/envs/hicat/lib/python3.10/site-packages/joblib/parallel.py", line 1056, in call self.retrieve() File "/miniconda3/envs/mamba/envs/hicat/lib/python3.10/site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/miniconda3/envs/mamba/envs/hicat/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/miniconda3/envs/mamba/envs/hicat/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/miniconda3/envs/mamba/envs/hicat/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}` looking forward to your answer. best wishes, ding

865699871 commented 1 year ago

I think you can first try thread=1 in test sequence. The test sequence is CHM13 CEN21 with only 331k. HiCAT can quickly complete this task. thread=51 might be so large. Here is a partial log and time(199 s). (sh testRunHiCAT.sh)

start build block sequence and read base sequence calculate ed distance ed distance thread: 1 1949 pre merge matrix 249 generation cover distribution for cluster HOR thread: 1 get result Time: 199

It doesn't require a lot of memory. But I still want to know the memory on your working machine?

I also tried thread=40 with python HiCAT.py -i ./testdata/cen21.fa -t ./testdata/AlphaSat.fa -th 40 and it still work well. Here is a partial log and time(11 s).

start build block sequence and read base sequence calculate ed distance ed distance thread: 40 1949 pre merge matrix 249 generation cover distribution for cluster HOR thread: 1 get result Time: 11

Henry-Ding commented 1 year ago

hi, Thank you for your prompt reply. Yes, I used the test sequence and it worked fine. But I downloaded the complete CP068257.1 from ncbi, and I got that memory error. hicat -i download.fasta -t ./testdata/AlphaSat.fa -th 52 I have 1T of memory. looking forward to your answer. best wishes, ding

865699871 commented 1 year ago

Hi, ding, unfortunately, current HiCAT can not input whole chromosome and it was design for only centromere region. I suggest you reduce the size of the input sequence first and then use our HiCAT. If you already have templet sequence, you can used lastz to find regions. If you do not have templet sequence, you can use TRF to first detect tandem repeats and obtain templet sequences.

Henry-Ding commented 1 year ago

hi, I am trying to use TRF, can you share how you filter and select the TRF results? looking forward to your answer. best wishes, ding

865699871 commented 1 year ago

I don't know why you use TRF. In human genome, we used the active HOR region defined in "Complete genomic and epigenetic maps of human centromeres".

865699871 commented 1 year ago

HiCAT can use for any tandem repeat region but cannot decide which one is centromere. It should be provided by user.

Henry-Ding commented 1 year ago

hi, Thank you for your prompt reply. I am trying hicat on my own data, should I use published centromeric sequences as templet sequences or TRF results? I used lastz to align published centromere sequences with chromosome sequences, and some chromosome alignments did not yield any results. If I use the TRF results, how do I confirm that the sequence is the centromere sequence I need? looking forward to your answer. best wishes, ding

865699871 commented 1 year ago

If previous studies can determine the centromere sequence, the results of previous studies can be used. If not, I suggest you to preform CENH3 chip-seq(CENP-A for human) to determine the functional centromere sequence. In most species, as I know, the centromere sequence is the largest tandem repeat sequence, but chip-seq is used to determine the functional centromere sequence.