使用pandas v1.1.0。
在pandas文档中,有一个关于如何使用numba来加速
rolling.apply()
操作的好例子
here
import pandas as pd
import numpy as np
def mad(x):
return np.fabs(x - x.mean()).mean()
df = pd.DataFrame({"A": np.random.randn(100_000)},
index=pd.date_range('1/1/2000', periods=100_000, freq='T')
).cumsum()
df.rolling(10).apply(mad, engine="numba", raw=True)
我想调整它,使之适用于groupby操作。
df['day'] = df.index.day
df.groupby('day').agg(mad)
works fine.
df.groupby('day').agg(mad, engine='numba')
错误,并给出了
---------------------------------------------------------------------------
NumbaUtilError Traceback (most recent call last)
<ipython-input-21-ee23f1eec685> in <module>
----> 1 df.groupby('day').agg(mad, engine='numba')
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
940 if maybe_use_numba(engine):
--> 941 return self._python_agg_general(
942 func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
943 )
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\groupby.py in _python_agg_general(self, func, engine, engine_kwargs, *args, **kwargs)
1069 if maybe_use_numba(engine):
-> 1070 result, counts = self.grouper.agg_series(
1071 obj,
1072 func,
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\ops.py in agg_series(self, obj, func, engine, engine_kwargs, *args, **kwargs)
624 if maybe_use_numba(engine):
--> 625 return self._aggregate_series_pure_python(
626 obj, func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
627 )
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\ops.py in _aggregate_series_pure_python(self, obj, func, engine, engine_kwargs, *args, **kwargs)
682 if maybe_use_numba(engine):
--> 683 numba_func, cache_key = generate_numba_func(
684 func, engine_kwargs, kwargs, "groupby_agg"
685 )
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\util\numba_.py in generate_numba_func(func, engine_kwargs, kwargs, cache_key_str)
215 nopython, nogil, parallel = get_jit_arguments(engine_kwargs)
216 check_kwargs_and_nopython(kwargs, nopython)
--> 217 validate_udf(func)
218 cache_key = (func, cache_key_str)
219 numba_func = NUMBA_FUNC_CACHE.get(
~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\util\numba_.py in validate_udf(func)
177 or udf_signature[:min_number_args] != expected_args
178 ):
--> 179 raise NumbaUtilError(
180 f"The first {min_number_args} arguments to {func.__name__} must be "
181 f"{expected_args}"
NumbaUtilError: The first 2 arguments to mad must be ['values', 'index']
我猜想,对于engine=numba,它所期望的数据会略有不同。