zstephens / neat-genreads

NEAT read simulation tools
Other
96 stars 27 forks source link

Memory error (coverage/threads optimization) #10

Closed muppetjones closed 8 years ago

muppetjones commented 8 years ago

I've generally been setting the number of threads equal to the coverage, but if I set the coverage too high (200x), this results in MemoryErrors.

I need about 100x coverage (a single sample with -c 100 nets me about 30-40x). I've been generating two sets of samples at 100x and combining them, but I'm not sure what sorts of side effects this may have, plus it takes 2x the time.

Do you have an estimate of the optimal coverage / thread settings?

zstephens commented 8 years ago

re:


On Oct 17, 2016, at 12:01 PM, Stephen J Bush notifications@github.com<mailto:notifications@github.com> wrote:

Makes sense. I wonder if the memory errors (separate issue) are due to parsing the reference or a high coverage. If it's the reference, I think the only way to fix it would be to pause all jobs while one calculates the regions and writes them to a file, then have each job read the regions...which would be a headache.

That was my first guess as well, that each separate job is essentially making it’s own internal copy of the reference (~3GB). It would be possible to toss the regions not needed for the current job, but unfortunately the way it currently works is as follows:

1) read in entire reference, split into isolated regions (by Ns, chromosome boundaries, etc) 2) based on job id and total number of jobs, divide up these regions and identify which ones are relevant for the current job 3) sample reads from the assigned regions

Even if we could toss out the unneeded reference sequence after step 2, there’s still the risk of memory errors when all the jobs are trying to do steps 1-2 simultaneously, so I’d have to rethink how that process is implemented to get around this.

I’m copying Luda on this email who might have more insight regarding NEAT performance when it comes to thread number. Aside from the kludgy parallel job implementation NEAT admittedly isn’t particularly optimized for performance, i.e. there aren’t any intentionally parallelized chunks of time-intensive code, or things like that.

-Zach

On Oct 17, 2016, at 11:39 AM, Stephen J Bush notifications@github.com<mailto:notifications@github.com> wrote:

I've generally been setting the number of threads equal to the coverage, but if I set the coverage too high (200x), this results in MemoryErrors.

I need about 100x coverage (a single sample with -c 100 nets me about 30-40x). I've been generating two sets of samples at 100x and combining them, but I'm not sure what sorts of side effects this may have, plus it takes 2x the time.

Do you have an estimate of the optimal coverage / thread settings?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_zstephens_neat-2Dgenreads_issues_10&d=DQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=IWolpGSyYJ4NeUjgWacWF0qOV0zQI2Q-Gd4uC97gzio&s=-Mbh8kG_cHTjTdNeXMtlSiWNPpgTmjwCUVyHlmA_8WY&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_notifications_unsubscribe-2Dauth_AIHT1aOxwFic0beNsaXKoRSadc7FSC34ks5q06SrgaJpZM4KY1xX&d=DQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=IWolpGSyYJ4NeUjgWacWF0qOV0zQI2Q-Gd4uC97gzio&s=gGpS8oFm3vivDhCC_T7W9jT0wicBiXfOP9aniy0RedY&e=.

muppetjones commented 8 years ago

For some reason, your msg did not show up on the other issue, so I'm reopening this one temporarily. I think fixing this would require a fairly large overhaul, which probably isn't worth it at the moment.

muppetjones commented 8 years ago

Running NEAT with higher levels of coverage works with fewer threads (specifically, I use the number of processors: THREADS=${1:-$(getconf _NPROCESSORS_ONLN)})