yinlou / mltk

Machine Learning Tool Kit
BSD 3-Clause "New" or "Revised" License
136 stars 74 forks source link

Regression trees cause GC churn #16

Closed lukehutch closed 8 years ago

lukehutch commented 8 years ago

The regression tree methods in MLTK allocate and drop a huge number of objects, which causes GC churn, and hugely strains the VM. A huge amount of time is spent in garbage collection, and the impact is even worse if you are trying to run several regressions in parallel, since the JVM doesn't do a good job of concurrent garbage collection. (The scalability issue alone will probably mean I can't use MLTK for my task.)

An object instance recycling scheme would help immensely with this problem.

yinlou commented 8 years ago

Which learner are you using? How large is your dataset? How much memory did you allocate?

lukehutch commented 8 years ago

I tried using RegressionTreeLearner with both LSBoostLearner and LADBoostLearner, both have the same problem.

I have up to about 124,000 training examples (with about 400 dimensions) and 10,000 test examples.

The amount of memory taken by a single thread is of the order of 4-8GB or so, but the amount fluctuates up and down by about 1-2GB every several seconds (the downward fluctuations are due to garbage collection).

Viewing CPU usage activity while running several regressors in different threads shows all the cores floating between about 20 and 60%, with major churn. This is indicative of heavy GC activity. Multithreaded Java programs that allocate no new objects keep all the cores busy at 100%.

yinlou commented 8 years ago

I just made some edits to save memory. Let me know if that helps. Thank you!