Improving groupby-apply microperformance

Consider the case of a DataFrame with a large number of distinct groups:

import numpy as np
arr = np.random.randn(5000000)
df = pd.DataFrame({'group': arr.astype('str').repeat(2)})
df['values'] = np.random.randn(len(df))
df.groupby('group').apply(lambda g: len(g))

I have

In [17]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 6.45 s, sys: 68 ms, total: 6.52 s
Wall time: 6.51 s

The per-group overhead is fairly fixed -- with 5 million groups we have:

In [22]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 31 s, sys: 108 ms, total: 31.1 s
Wall time: 31.1 s

It would be interesting to see if, but pushing down the disassembly-reassembly of DataFrame objects into C++ whether we can take the overhead from the current ~6 microseconds to under a microsecond or even less.

Note that the effects of bad memory locality are also a factor. We could look into tricks like using a background thread which "prepares" groups (up to a certain size / buffer threshold) while user apply functions are executing, to at least mitigate the time aspect of the groupby evaluation.

wesm / pandas2

Improving groupby-apply microperformance #8