xtensor-stack / xtensor-python

Python bindings for xtensor
BSD 3-Clause "New" or "Revised" License
347 stars 58 forks source link

slow iteration #88

Closed nbecker closed 7 years ago

nbecker commented 7 years ago

Test files are here: https://gist.github.com/nbecker/13c6ca2869d8b5266b6a1e5ae64e9806

This is not much different than previous benchmark. The actual benchmark is at the end of test_logsumexp.py

nbecker commented 7 years ago

I seem to have broken something, now it segfaults

JohanMabille commented 7 years ago

Correct me if I'm wrong, but I guess that line 35 you want to take all elements along the i-th axis if it is not present in the axes parameter ?

In that case, the correct code would be

    if (std::find (axes.begin(), axes.end(),  i) != axes.end())
        sv.push_back(xt::newaxis());
    else
        sv.push_back(xt::all());
nbecker commented 7 years ago

After fixing that (thanks!) and retesting with:

from timeit import timeit
u = np.ones ((10, 100000))
v = np.ones ((100000, 10))
for x in ('u', 'v'):
    for l in ('logsumexp', 'logsumexp2'):
        print (l + '(%s, 1,))'%x, timeit (l + '(%s, (1,))'%x, 'from __main__ import %s, %s'%(l,x), number=100))
        print (l + '(%s, 0,))'%x, timeit (l + '(%s, (0,))'%x, 'from __main__ import %s, %s'%(l,x), number=100))        

I get

logsumexp(u, 1,)) 16.155551936000023
logsumexp(u, 0,)) 2.8450079509999853
logsumexp2(u, 1,)) 1.2417155439999874
logsumexp2(u, 0,)) 1.5459593610000013
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  SystemError: <built-in method __contains__ of dict object at 0x7f205b71af48> returned a result with an error set
Aborted (core dumped)

So 2 issues:

  1. benchmarks of xtensor version not very good
  2. don't know what is causing the error
JohanMabille commented 7 years ago

The error is due to a huge memory allocation attempt: 100k x 100k, when calling logsumexp(v, 1). Here is a decomposition of the last lines of logsumexp with temporaries and their shapes:

auto max2 = xt::dynamic_view(max, sv);              // shape = { 100000, 1}
xt::pyarray<value_type> tmp1 = xt::exp(e1 - max2);  // shape = { 100000, 10}
xt::pyarray<value_type> tmp2 = xt::sum(tmp1, axes); // shape = { 100000 }
xt::pyarray<value_type> tmp3 = xt::log(tmp2);       // shape = {100000}
xt::pyarray<value_type> res = max2 + tmp3;          // shape = { 100000, 100000}

The shape of res is the result of broadcasting { 100000, 1} with { 100000}.

Replacing max2 with max in the last line should fix this problem.

nbecker commented 7 years ago

I'm confused. The purpose of introducing max2 was that max can't be used, because it won't in general broadcast against e1. numpy amax has a keepdims keyword, which is what I'd need here (but xt::amax lacks), so I had to simulate it by introducing max2 which puts back the missing dimensions (with size==1).

That large allocation isn't supposed to happen. Am I doing this all wrong? The whole objective here was to write the equivalent of scipy.misc.logsumexp https://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.misc.logsumexp.html

JohanMabille commented 7 years ago

Your usage of max2 to compute exp(e1 - max2) is correct, and you're right on this: in the general case, e1 and max won't broadcast. Indeed, max is a reduction of e1 across specified axes. Thus max2 is still required to compute tmp1. At this stage of the computation, tmp1 and e1 have the same shape, that is, { 100000, 10}.

Now, when you compute tmp2, you perform the same type of reduction (i.e. across the same axes) on tmp1 that you did on e1 to get max. That means tmp2 and max will have the same shape, i.e. {100000}. Then tmp3 has the same shape as tmp2 since no broadcasting operation is performed.

At this stage, we have:

Both tmp3 + max2 and tmp3 + max are valid expressions, however the first one ends up with a 2-D expression with shape { 100000, 100000}, the last one ends up with a 1-D expression with shape { 100000}, which is what you want.

JohanMabille commented 7 years ago

Here are the results of the benchmarks (not the same number of iterations) on my laptop:

logsumexp(u, 1,)) 2.8467595061728397
logsumexp(u, 0,)) 3.709872592592592
logsumexp2(u, 1,)) 1.3747942716049382
logsumexp2(u, 0,)) 1.5532693333333336
logsumexp(v, 1,)) 3.6790692345679012
logsumexp(v, 0,)) 3.1909791604938285
logsumexp2(v, 1,)) 2.092781827160497
logsumexp2(v, 0,)) 1.6343920987654315

I start to investigate on the performance issue.

nbecker commented 7 years ago

That's much better! Here are my results: logsumexp(u, 0,)) 2.7423938450001515 logsumexp2(u, 0,)) 1.568801411000095 logsumexp(u, 1,)) 2.0876657210001213 logsumexp2(u, 1,)) 1.3879509569997026 logsumexp(v, 0,)) 2.3853387870003644 logsumexp2(v, 0,)) 1.608310051999979 logsumexp(v, 1,)) 2.6268803089997164 logsumexp2(v, 1,)) 2.058452197999941

On Wed, May 3, 2017 at 11:28 AM Johan Mabille notifications@github.com wrote:

Here are the results of the benchmarks (not the same number of iterations) on my laptop:

logsumexp(u, 1,)) 2.8467595061728397 logsumexp(u, 0,)) 3.709872592592592 logsumexp2(u, 1,)) 1.3747942716049382 logsumexp2(u, 0,)) 1.5532693333333336 logsumexp(v, 1,)) 3.6790692345679012 logsumexp(v, 0,)) 3.1909791604938285 logsumexp2(v, 1,)) 2.092781827160497 logsumexp2(v, 0,)) 1.6343920987654315

I start to investigate on the performance issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-298946276, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0IEYdRN5fF294CnHGKgKgUDUQjKDks5r2J0BgaJpZM4NI359 .

nbecker commented 7 years ago

I reordered the tests for easier comparison

u = np.ones ((10, 100000))
v = np.ones ((100000, 10))
for x in ('u', 'v'):
    for axis in (0, 1):
        for l in ('logsumexp', 'logsumexp2'):
            print (l + '(%s, %s,))'%(x, axis), timeit (l + '(%s, (%s,))'%(x, axis), 'from __main__ import %s, %s'%(l,x), number=100))
nbecker commented 7 years ago

I've also been trying to learn about how to do this in Julia, and have some interesting results https://discourse.julialang.org/t/apply-reduction-along-specific-axes/3301/15?u

On Wed, May 3, 2017 at 11:28 AM Neal Becker ndbecker2@gmail.com wrote:

That's much better! Here are my results: logsumexp(u, 0,)) 2.7423938450001515 logsumexp2(u, 0,)) 1.568801411000095 logsumexp(u, 1,)) 2.0876657210001213 logsumexp2(u, 1,)) 1.3879509569997026 logsumexp(v, 0,)) 2.3853387870003644 logsumexp2(v, 0,)) 1.608310051999979 logsumexp(v, 1,)) 2.6268803089997164 logsumexp2(v, 1,)) 2.058452197999941

On Wed, May 3, 2017 at 11:28 AM Johan Mabille notifications@github.com wrote:

Here are the results of the benchmarks (not the same number of iterations) on my laptop:

logsumexp(u, 1,)) 2.8467595061728397 logsumexp(u, 0,)) 3.709872592592592 logsumexp2(u, 1,)) 1.3747942716049382 logsumexp2(u, 0,)) 1.5532693333333336 logsumexp(v, 1,)) 3.6790692345679012 logsumexp(v, 0,)) 3.1909791604938285 logsumexp2(v, 1,)) 2.092781827160497 logsumexp2(v, 0,)) 1.6343920987654315

I start to investigate on the performance issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-298946276, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0IEYdRN5fF294CnHGKgKgUDUQjKDks5r2J0BgaJpZM4NI359 .

JohanMabille commented 7 years ago

Indeed the implementation in Julia seems really performant.

I've worked on reducers, could you try with the branch performance of my fork ? On my side I have the following numbers:

master branch:


logsumexp(u, 0,)) 3.709872592592592
logsumexp2(u, 0,)) 1.5532693333333336
logsumexp(u, 1,)) 2.8467595061728397
logsumexp2(u, 1,)) 1.3747942716049382
logsumexp(v, 0,)) 3.1909791604938285
logsumexp2(v, 0,)) 1.6343920987654315
logsumexp(v, 1,)) 3.6790692345679012
logsumexp2(v, 1,)) 2.092781827160497

performance branch:

logsumexp(u, 0,)) 3.015181827160494
logsumexp2(u, 0,)) 1.4681777777777776
logsumexp(u, 1,)) 2.451084641975309
logsumexp2(u, 1,)) 1.3026453333333334
logsumexp(v, 0,)) 2.7342716049382716
logsumexp2(v, 0,)) 1.7015956543209878
logsumexp(v, 1,)) 3.1623810370370364
logsumexp2(v, 1,)) 2.248043851851852

We're still slower but this seems to go in the right direction.

nbecker commented 7 years ago

Sorry, my git-fu isn't up to it. I tried: git clone https://github.com/QuantStack/xtensor.git xtensor.jm Cloning into 'xtensor.jm'... remote: Counting objects: 4504, done. remote: Compressing objects: 100% (47/47), done. remote: Total 4504 (delta 19), reused 0 (delta 0), pack-reused 4451 Receiving objects: 100% (4504/4504), 1.58 MiB | 0 bytes/s, done. Resolving deltas: 100% (3096/3096), done. Checking connectivity... done. [nbecker@nbecker2 ~]$ cd xtensor.jm [nbecker@nbecker2 xtensor.jm]$ git fetch https://github.com/JohanMabille/xtensor.git From https://github.com/JohanMabille/xtensor

But I don't see any sign of a performance branch, so I guess this wasn't correct.

On Thu, May 4, 2017 at 12:02 PM Johan Mabille notifications@github.com wrote:

Indeed the implementation in Julia seems really performant.

I've worked on reducers, could you try with the branch performance of my fork ? On my side I have the following numbers:

master branch:

logsumexp(u, 0,)) 3.709872592592592 logsumexp2(u, 0,)) 1.5532693333333336 logsumexp(u, 1,)) 2.8467595061728397 logsumexp2(u, 1,)) 1.3747942716049382 logsumexp(v, 0,)) 3.1909791604938285 logsumexp2(v, 0,)) 1.6343920987654315 logsumexp(v, 1,)) 3.6790692345679012 logsumexp2(v, 1,)) 2.092781827160497

performance branch:

logsumexp(u, 0,)) 3.015181827160494 logsumexp2(u, 0,)) 1.4681777777777776 logsumexp(u, 1,)) 2.451084641975309 logsumexp2(u, 1,)) 1.3026453333333334 logsumexp(v, 0,)) 2.7342716049382716 logsumexp2(v, 0,)) 1.7015956543209878 logsumexp(v, 1,)) 3.1623810370370364 logsumexp2(v, 1,)) 2.248043851851852

We're still slower but this seems to go in the right direction.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-299230785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0M_DLAP4VQKJuSdDF4EeX3Y0SpoRks5r2fZ-gaJpZM4NI359 .

JohanMabille commented 7 years ago

If you already cloned quantstack/xtensor, you don't have to clone my fork, you can just add my repo as a new remote and checkout my branch:

cd xtensor
git remote add jm https://github.com/JohanMabille/xtensor.git
git fetch jm
git checkout performance

After that, you can run git branch or gitk to check the current branch is correct.

nbecker commented 7 years ago

I did rebase of your performance change to xtensor onto master, which applied cleanly. But my code doesn't compile:

/home/nbecker/.local/include/xtensor-python/pycontainer.hpp:80:78: error:
'typename xt::pycontainer<xt::pyarray<double> >::iterable_base {aka
xt::xiterable<xt::pyarray<double> >}::broadcast_iterator' names
'template<enum class xt::layout_type L> using broadcast_iterator =
xt::xconst_iterable<xt::pyarray<double> >::broadcast_iterator<L>', which is
not a type
         using broadcast_iterator = typename
iterable_base::broadcast_iterator;

On Fri, May 5, 2017 at 7:29 AM Johan Mabille notifications@github.com wrote:

If you already cloned quantstack/xtensor, you don't have to clone my fork, you can just add my repo as a new remote and checkout my branch:

cd xtensor git remote add jm https://github.com/JohanMabille/xtensor.git git fetch jm git checkout performance

After that, you can run git branch or gitk to check the current branch is correct.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-299441435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0JUyOFwJuuNde6BrJgXqQyexApsrks5r2wflgaJpZM4NI359 .

JohanMabille commented 7 years ago

You need to update xtensor-python, it looks like your version is not 0.11.1

nbecker commented 7 years ago

OK, but I needed:

*** /home/nbecker/xtensor-python.hg/include/xtensor-python/pycontainer.hpp
2017-05-05
07:47:15.893865148 -0400
--- /home/nbecker/.local/include/xtensor-python/pycontainer.hpp 2017-05-05
07:54:52.351386557 -0400
***************
*** 161,167 ****

      template <class D>
      inline pycontainer<D>::pycontainer(const pybind11::object& o)
!         : pybind11::object(raw_array_t(o.ptr()), pybind11::object::stolen)
      {
          if (!this->m_ptr)
          {
--- 161,167 ----

      template <class D>
      inline pycontainer<D>::pycontainer(const pybind11::object& o)
!       : pybind11::object(raw_array_t(o.ptr()), stolen_t{})
      {
          if (!this->m_ptr)
          {

On Fri, May 5, 2017 at 7:44 AM Johan Mabille notifications@github.com wrote:

You need to update xtensor-python, it looks like your version is not 0.11.1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-299444073, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0MZ2fi3z5NF04r1AoPAHTcR6-iptks5r2wu5gaJpZM4NI359 .

nbecker commented 7 years ago

And I can't run the logsumexp test, I get segfault on the simplest test python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information.

from xtensor_test.logsumexp import logsumexp import numpy as np u = np.ones(4) v = logsumexp(u) Segmentation fault (core dumped)

On Fri, May 5, 2017 at 7:57 AM Neal Becker ndbecker2@gmail.com wrote:

OK, but I needed:

*** /home/nbecker/xtensor-python.hg/include/xtensor-python/pycontainer.hpp 2017-05-05
07:47:15.893865148 -0400
--- /home/nbecker/.local/include/xtensor-python/pycontainer.hpp 2017-05-05
07:54:52.351386557 -0400
***************
*** 161,167 ****

      template <class D>
      inline pycontainer<D>::pycontainer(const pybind11::object& o)
!         : pybind11::object(raw_array_t(o.ptr()),
pybind11::object::stolen)
      {
          if (!this->m_ptr)
          {
--- 161,167 ----

      template <class D>
      inline pycontainer<D>::pycontainer(const pybind11::object& o)
!       : pybind11::object(raw_array_t(o.ptr()), stolen_t{})
      {
          if (!this->m_ptr)
          {

On Fri, May 5, 2017 at 7:44 AM Johan Mabille notifications@github.com wrote:

You need to update xtensor-python, it looks like your version is not 0.11.1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor-python/issues/88#issuecomment-299444073, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHK0MZ2fi3z5NF04r1AoPAHTcR6-iptks5r2wu5gaJpZM4NI359 .

JohanMabille commented 7 years ago

Problem fixed on performance branch, the fixed will be merged in master soon.

JohanMabille commented 7 years ago

Performance have been improved in xtensor 0.10.1. Thanks again for reporting!