zhoujt1994 / scHiCluster

MIT License
55 stars 17 forks source link

Gaussian filter memory error #19

Closed sshen82 closed 10 months ago

sshen82 commented 11 months ago

I want to do imputation on a 10kb dataset, and was trying the algorithm on one cell. However, it just shows the memory error like "MemoryError: Unable to allocate 278. GiB for an array with shape (273217, 273217) and data type float32". Do you have a solution on this? Do you think it is wise to skip the "pad" parameter to avoid gaussian_filter?

sshen82 commented 11 months ago

Even if I set pad to be zero, np.triu_indices will give another memory error.

zhoujt1994 commented 10 months ago

Do you need to do imputation on all-by-all chromosomes? We typically do it only within each chromosome, and when the resolution is high we only impute within a certain distance from the main diagonal. For example, when impute at 10kb resolution, we only do it for <5Mb. In general, I would not trust the imputation for trans or super long distance contacts at high resolution. See https://zhoujt1994.github.io/scHiCluster/Tan2021/imputation.html

sshen82 commented 10 months ago

That I agree and I switched to doing imputation chromosome by chromosome (only cis). However, now the problem is that the resulting matrices are pretty dense even after I directly filtered out interactions > 2mb. I have to run the jobs separately (like 8000 cells a time). It's also pretty slow. I have 50000 cells and for one night, only several hundred cells are done. I guess there isn't anything you can do now, since I am in the realm of 10kb, and everything is slow and huge there. Thank you for your advise though!

zhoujt1994 commented 10 months ago

In our setting, we run the imputation on anvil supercomputer clusters, where each node has 128 cpus and it takes <8h to finish imputation of 1536 cells at 10kb resolution for <5Mb. And if you can use 40 such nodes in total, you can finish 60k cells overnight. If you want to do <2Mb it will be even faster. We works on 50-200k cells as a routine now which I think is not a huge problem. You can see our parameter settings in the link I sent above. The storage could be an issue as you mentioned. I guess you need to prepare ~10T disk to store these cool files of 50k cells.

sshen82 commented 10 months ago

Well, our university is poorer and the computational resources are limited, so even if I am using the high throughput computing system, it isn't as fast as yours :)

Get Outlook for iOShttps://aka.ms/o0ukef


From: zhoujt1994 @.> Sent: Tuesday, October 24, 2023 1:27:33 AM To: zhoujt1994/scHiCluster @.> Cc: SIQI SHEN @.>; State change @.> Subject: Re: [zhoujt1994/scHiCluster] Gaussian filter memory error (Issue #19)

In our setting, we run the imputation on anvil supercomputer clusters, where each node has 128 cpus and it takes <8h to finish imputation of 1536 cells at 10kb resolution for <5Mb. And if you can use 40 such nodes in total, you can finish 60k cells overnight. If you want to do <2Mb it will be even faster. We works on 50-200k cells as a routine now which I think is not a huge problem. You can see our parameter settings in the link I sent above. The storage could be an issue as you mentioned. I guess you need to prepare ~10T disk to store these cool files of 50k cells.

— Reply to this email directly, view it on GitHubhttps://github.com/zhoujt1994/scHiCluster/issues/19#issuecomment-1776604602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHEU3KH3OZKRO7AYNDXFQU3YA5NVLAVCNFSM6AAAAAA6FX2FPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGYYDINRQGI. You are receiving this because you modified the open/close state.Message ID: @.***>

zhoujt1994 commented 10 months ago

Yes I agree. Sorry about that. Then I guess maybe change the output_dist, window_size, and step_size, these could make it faster. For example, output_dist=2000000, window_size=6000000, step_size=2000000 in your case.

sshen82 commented 10 months ago

I see, that sounds great. Thank you for the advice!


From: zhoujt1994 @.> Sent: Tuesday, October 24, 2023 2:02 PM To: zhoujt1994/scHiCluster @.> Cc: SIQI SHEN @.>; State change @.> Subject: Re: [zhoujt1994/scHiCluster] Gaussian filter memory error (Issue #19)

Yes I agree. Sorry about that. Then I guess maybe change the output_dist, window_size, and step_size, these could make it faster. For example, output_dist=2000000, window_size=6000000, step_size=2000000 in your case.

— Reply to this email directly, view it on GitHubhttps://github.com/zhoujt1994/scHiCluster/issues/19#issuecomment-1777851561, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHEU3KEJRV2YLT6Y3CCI5L3YBAGDPAVCNFSM6AAAAAA6FX2FPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXHA2TCNJWGE. You are receiving this because you modified the open/close state.Message ID: @.***>