yandex / rep

Machine Learning toolbox for Humans
http://yandex.github.io/rep/
Other
687 stars 144 forks source link

Numexpr and multithread failures #15

Closed arogozhnikov closed 9 years ago

arogozhnikov commented 9 years ago

When training in threads many models and passing columns argument, this drives to kernel restart.

After 2-hour debugging we found out this is due to numexpr, which is unable to work normally in threads. Minimal failing example:

from rep.metaml.utils import map_on_cluster
import pandas
import numpy
from rep.utils import get_columns_in_df

columns = ['Feature_0', 'Feature_1']
size = 10000
x = pandas.DataFrame(numpy.random.random([size, 2]), columns=columns)

def f(x, columns):
    x = get_columns_in_df(x.copy(), columns)
    return x

n = 50
result = map_on_cluster('threads-4', f, [x] * n, [columns[:1]] * n)
print 'ok'

At least this means we should minimize usage of numexpr (if not completely exclude).

arogozhnikov commented 9 years ago

Partially fixed this in https://github.com/yandex/rep/commit/cbd019a83acd5aa7bf6338e863226a97f0abd972

But there are still will be problems with features-as-expressions (unavoidable at the moment)

arogozhnikov commented 9 years ago

Fixed in https://github.com/yandex/rep/commit/6cd44f649693fe3486c0574623305964d4bb44bc

I replaced numexpr with pandas.eval(engine='python').

Some drawbacks of this solution:

  1. slower
  2. no support of exp, log, sin, etc.

But python evaluation seems to be more reliable.