Instances become idle or crash

xniu7 / jhuclass.genome.indexing

1 stars 2 forks source link

Instances become idle or crash #2

Open xniu7 opened 10 years ago

xniu7 commented 10 years ago

When the RDD of a single machine is larger than its memory, the instance will become idle or crash, but other instances may still work. This error is hard to be detected because out-of-memory message is not always printed.

xniu7 commented 10 years ago

When I use m2.2 xlarge (300GB total memory) instead of m1.xlarge (150GB total memory), part of tasks run success (23 success, 17 idle). Peak memory usage is (250GB) when dealing with ERR194174_100M data with prefix CC. One possible reason is that memory usage among tasks are different.

xniu7 commented 10 years ago

If we increase the number of tasks 40->80, which means average task memory usage is changed from 6GB->3GB. In this way, any 4 tasks running in a machine (4 cores) memory usage is change from 24GB->12 GB. In this way, we can avoid instance crash when 4 big tasks run in the same instance together. However, sortByKey time is increase from 4min->8min. Thus, this is a tradeoff between speed and space.

xniu7 commented 10 years ago

Two important factors:

The overall peak memory usage of the entire cluster. As for 100M reads test with two char prefixes such as "CC", we need a cluster with more than 250GB. The cluster of 10 m2.2 xlarge instances is acceptable.
Threads number configuration. Breaking a big task into several small tasks will cost more time, but make tasks work successful in a limited memory space.