相关文章推荐

Group by: split-apply-combine #

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria.

  • Applying a function to each group independently.

  • Combining the results into a data structure.

  • Out of these, the split step is the most straightforward. In the apply step, we might wish to do one of the following:

  • Aggregation : compute a summary statistic (or statistics) for each group. Some examples:

  • Compute group sums or means.

  • Compute group sizes / counts.

  • Transformation : perform some group-specific computations and return a like-indexed object. Some examples:

  • Standardize data (zscore) within a group.

  • Filling NAs within groups with a value derived from each group.

  • Filtration : discard some groups, according to a group-wise computation that evaluates to True or False. Some examples:

  • Discard data that belong to groups with only a few members.

  • Filter out data based on the group sum or mean.

  • Many of these operations are defined on GroupBy objects. These operations are similar to those of the aggregating API , window API , and resample API .

    It is possible that a given operation does not fall into one of these categories or is some combination of them. In such a case, it may be possible to compute the operation using GroupBy’s apply method. This method will examine the results of the apply step and try to sensibly combine them into a single result if it doesn’t fit into either of the above three categories.

    An operation that is split into multiple steps using built-in GroupBy operations will be more efficient than using the apply method with a user-defined Python function.

    The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools ), in which you can write code like:

    SELECT Column1, Column2, mean(Column3), sum(Column4)
    FROM SomeTable
    GROUP BY Column1, Column2
    

    We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality, then provide some non-trivial examples / use cases.

    See the cookbook for some advanced strategies.

    Splitting an object into groups#

    The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

    In [1]: speeds = pd.DataFrame(
       ...:     [
       ...:         ("bird", "Falconiformes", 389.0),
       ...:         ("bird", "Psittaciformes", 24.0),
       ...:         ("mammal", "Carnivora", 80.2),
       ...:         ("mammal", "Primates", np.nan),
       ...:         ("mammal", "Carnivora", 58),
       ...:     ],
       ...:     index=["falcon", "parrot", "lion", "monkey", "leopard"],
       ...:     columns=("class", "order", "max_speed"),
       ...: )
    In [2]: speeds
    Out[2]: 
              class           order  max_speed
    falcon     bird   Falconiformes      389.0
    parrot     bird  Psittaciformes       24.0
    lion     mammal       Carnivora       80.2
    monkey   mammal        Primates        NaN
    leopard  mammal       Carnivora       58.0
    In [3]: grouped = speeds.groupby("class")
    In [4]: grouped = speeds.groupby(["class", "order"])
    

    The mapping can be specified many different ways:

  • A Python function, to be called on each of the index labels.

  • A list or NumPy array of the same length as the index.

  • A dict or Series, providing a label -> group name mapping.

  • For DataFrame objects, a string indicating either a column name or an index level name to be used to group.

  • A list of any of the above things.

  • Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

    A string passed to groupby may refer to either a column or an index level. If a string matches both a column name and an index level name, a ValueError will be raised.

    In [5]: df = pd.DataFrame(
       ...:     {
       ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
       ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
       ...:         "C": np.random.randn(8),
       ...:         "D": np.random.randn(8),
       ...:     }
       ...: )
    In [6]: df
    Out[6]: 
         A      B         C         D
    0  foo    one  0.469112 -0.861849
    1  bar    one -0.282863 -2.104569
    2  foo    two -1.509059 -0.494929
    3  bar  three -1.135632  1.071804
    4  foo    two  1.212112  0.721555
    5  bar    two -0.173215 -0.706771
    6  foo    one  0.119209 -1.039575
    7  foo  three -1.044236  0.271860
    

    On a DataFrame, we obtain a GroupBy object by calling groupby(). This method returns a pandas.api.typing.DataFrameGroupBy instance. We could naturally group by either the A or B columns, or both:

    In [7]: grouped = df.groupby("A")
    In [8]: grouped = df.groupby("B")
    In [9]: grouped = df.groupby(["A", "B"])
    

    df.groupby('A') is just syntactic sugar for df.groupby(df['A']).

    If we also have a MultiIndex on columns A and B, we can group by all the columns except the one we specify:

    In [10]: df2 = df.set_index(["A", "B"])
    In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))
    In [12]: grouped.sum()
    Out[12]: 
                C         D
    bar -1.591710 -1.739537
    foo -0.752861 -1.402938
    

    The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do a transpose:

    In [13]: def get_letter_type(letter):
       ....:     if letter.lower() in 'aeiou':
       ....:         return 'vowel'
       ....:     else:
       ....:         return 'consonant'
       ....: 
    In [14]: grouped = df.T.groupby(get_letter_type)
    

    pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:

    In [15]: index = [1, 2, 3, 1, 2, 3]
    In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], index=index)
    In [17]: s
    Out[17]: 
    1     1
    2     2
    3     3
    1    10
    2    20
    3    30
    dtype: int64
    In [18]: grouped =
    
    
    
    
        
     s.groupby(level=0)
    In [19]: grouped.first()
    Out[19]: 
    1    1
    2    2
    3    3
    dtype: int64
    In [20]: grouped.last()
    Out[20]: 
    1    10
    2    20
    3    30
    dtype: int64
    In [21]: grouped.sum()
    Out[21]: 
    1    11
    2    22
    3    33
    dtype: int64
    

    Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.

    Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations (though it can’t be guaranteed to be the most efficient implementation). You can get quite creative with the label mapping functions.

    GroupBy sorting#

    By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups. With sort=False the order among group-keys follows the order of appearance of the keys in the original dataframe:

    In [22]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})
    In [23]: df2.groupby(["X"]).sum()
    Out[23]: 
    In [24]: df2.groupby(["X"], sort=False).sum()
    Out[24]: 
    

    Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:

    In [25]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})
    In [26]: df3.groupby("X").get_group("A")
    Out[26]: 
    0  A  1
    2  A  3
    In [27]: df3.groupby(["X"]).get_group(("B",))
    Out[27]: 
    1  B  4
    3  B  2
    

    GroupBy dropna#

    By default NA values are excluded from group keys during the groupby operation. However, in case you want to include NA values in group keys, you could pass dropna=False to achieve it.

    In [28]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
    In [29]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
    In [30]: df_dropna
    Out[30]: 
       a    b  c
    0  1  2.0  3
    1  1  NaN  4
    2  2  1.0  3
    3  1  2.0  2
    
    # Default ``dropna`` is set to True, which will exclude NaNs in keys
    In [31]: df_dropna.groupby(by=["b"], dropna=True).sum()
    Out[31]: 
    1.0  2  3
    2.0  2  5
    # In order to allow NaN in keys, set ``dropna`` to False
    In [32]: df_dropna.groupby(by=["b"], dropna=False).sum()
    Out[32]: 
    1.0  2  3
    2.0  2  5
    NaN  1  4
    

    The default setting of dropna argument is True which means NA are not included in group keys.

    GroupBy object attributes#

    The groups attribute is a dictionary whose keys are the computed unique groups and corresponding values are the axis labels belonging to each group. In the above example we have:

    In [33]: df.groupby("A").groups
    Out[33]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
    In [34]: df.T.groupby(get_letter_type).groups
    Out[34]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}
    

    Calling the standard Python len function on the GroupBy object returns the number of groups, which is the same as the length of the groups dictionary:

    In [35]: grouped = df.groupby(["A", "B"])
    In [36]: grouped.groups
    Out[36]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
    In [37]: len(grouped)
    Out[37]: 6
    

    GroupBy will tab complete column names, GroupBy operations, and other attributes:

    In [38]: n = 10
    In [39]: weight = np.random.normal(166, 20, size=n)
    In [40]: height = np.random.normal(60, 10, size=n)
    In [41]: time = pd.date_range("1/1/2000", periods=n)
    In [42]: gender = np.random.choice(["male", "female"], size=n)
    In [43]: df = pd.DataFrame(
       ....:     {"height": height, "weight": weight, "gender": gender}, index=time
       ....: )
       ....: 
    In [44]: df
    Out[44]: 
                   height      weight  gender
    2000-01-01  42.849980  157.500553    male
    2000-01-02  49.607315  177.340407    male
    2000-01-03  56.293531  171.524640    male
    2000-01-04  48.421077  144.251986  female
    2000-01-05  46.556882  152.526206    male
    2000-01-06  68.448851  168.272968  female
    2000-01-07  70.757698  136.431469    male
    2000-01-08  58.909500  176.499753  female
    2000-01-09  76.435631  174.094104  female
    2000-01-10  45.306120  177.540920    male
    In [45]: gb = df.groupby("gender")
    
    In [46]: gb.<TAB>  # noqa: E225, E999
    gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
    gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
    gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight
    

    GroupBy with MultiIndex#

    With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.

    Let’s create a Series with a two-level MultiIndex.

    In [47]: arrays = [
       ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
       ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
       ....: ]
       ....: 
    In [48]: index = pd.
    
    
    
    
        
    MultiIndex.from_arrays(arrays, names=["first", "second"])
    In [49]: s = pd.Series(np.random.randn(8), index=index)
    In [50]: s
    Out[50]: 
    first  second
    bar    one      -0.919854
           two      -0.042379
    baz    one       1.247642
           two      -0.009920
    foo    one       0.290213
           two       0.495767
    qux    one       0.362949
           two       1.548106
    dtype: float64
    

    We can then group by one of the levels in s.

    In [51]: grouped = s.groupby(level=0)
    In [52]: grouped.sum()
    Out[52]: 
    first
    bar   -0.962232
    baz    1.237723
    foo    0.785980
    qux    1.911055
    dtype: float64
    

    If the MultiIndex has names specified, these can be passed instead of the level number:

    In [53]: s.groupby(level="second").sum()
    Out[53]: 
    second
    one    0.980950
    two    1.991575
    dtype: float64
    

    Grouping with multiple levels is supported.

    In [54]: arrays = [
       ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
       ....:     ["doo", "doo", "bee", "bee", "bop", "bop", "bop", "bop"],
       ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
       ....: ]
       ....: 
    In [55]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second", "third"])
    In [56]: s = pd.Series(np.random.randn(8), index=index)
    In [57]: s
    Out[57]: 
    first  second  third
    bar    doo     one     -1.131345
                   two     -0.089329
    baz    bee     one      0.337863
                   two     -0.945867
    foo    bop     one     -0.932132
                   two      1.956030
    qux    bop     one      0.017587
                   two     -0.016692
    dtype: float64
    In [58]: s.groupby(level=["first", "second"]).sum()
    Out[58]: 
    first  second
    bar    doo      -1.220674
    baz    bee      -0.608004
    foo    bop       1.023898
    qux    bop       0.000895
    dtype: float64
    

    Index level names may be supplied as keys.

    In [59]: s.groupby(["first", "second"]).sum()
    Out[59]: 
    first  second
    bar    doo      -1.220674
    baz    bee      -0.608004
    foo    bop       1.023898
    qux    bop       0.000895
    dtype: float64
    

    More on the sum function and aggregation later.

    Grouping DataFrame with Index levels and columns#

    A DataFrame may be grouped by a combination of columns and index levels. You can specify both column and index names, or use a Grouper.

    Let’s first create a DataFrame with a MultiIndex:

    In [60]: arrays = [
       ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
       ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
       ....: ]
       ....: 
    In [61]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
    In [62]: df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)
    In [63]: df
    Out[63]: 
    first second      
    bar   one     1  0
          two     1  1
    baz   one     1  2
          two     1  3
    foo   one     2  4
          two     2  5
    qux   one     3  6
          two     3  7
    

    Then we group df by the second index level and the A column.

    In [64]: df.groupby([pd.Grouper(level=1), "A"]).sum()
    Out[64]: 
    second A   
    one    1  2
    two    1  4
    

    Index levels may also be specified by name.

    In [65]: df.groupby([pd.Grouper(level="second"), "A"]).sum()
    Out[65]: 
    second A   
    one    1  2
    two    1  4
    

    Index level names may be specified as keys directly to groupby.

    In [66]: df.groupby(["second", "A"]).sum()
    Out[66]: 
    second A   
    one    1  2
    two    1  4
    

    DataFrame column selection in GroupBy#

    Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. Thus, by using [] on the GroupBy object in a similar way as the one used to get a column from a DataFrame, you can do:

    In [67]: df = pd.DataFrame(
       ....:     {
       ....:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
       ....:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
       ....:         "C": np.random.randn(8),
       ....:         "D": np.random.randn(8),
       ....:     }
       ....: )
       ....: 
    In [68]: df
    Out[68]: 
         A      B         C         D
    0  foo    one -0.575247  1.346061
    1  bar    one  0.254161  1.511763
    2  foo    two -1.143704  1.627081
    3  bar  three  0.215897 -0.990582
    4  foo    two  1.193555 -0.441652
    5  bar    two -0.077118  1.211526
    6  foo    one -0.408530  0.268520
    7  foo  three -0.862495  0.024580
    In [69]: grouped = df.groupby(["A"])
    In [70]: grouped_C = grouped["C"]
    In [71]: grouped_D = grouped["D"]
    

    This is mainly syntactic sugar for the alternative, which is much more verbose:

    In [72]: df["C"].groupby(df["A"])
    Out[72]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7ff2cef1c730>
    

    Additionally, this method avoids recomputing the internal grouping information derived from the passed key.

    You can also include the grouping columns if you want to operate on them.

    In [73]: grouped[["A", "B"]].sum()
    Out[73]: 
                       A                  B
    bar        barbarbar        onethreetwo
    foo  foofoofoofoofoo  onetwotwoonethree
    

    Iterating through groups#

    With the GroupBy object in hand, iterating through the grouped data is very natural and functions similarly to itertools.groupby():

    In [74]: grouped = df.groupby('A')
    In [75]: for name, group in grouped:
       ....:     print(name)
       ....:     print(group)
       ....: 
         A      B         C         D
    1  bar    one  0.254161  1.511763
    3  bar  three  0.215897 -0.990582
    5  bar    two -0.077118  1.211526
         A      B         C         D
    0  foo    one -0.575247  1.346061
    2  foo    two -1.143704  1.627081
    4  foo    two  1.193555 -0.441652
    6  foo    one -0.408530  0.268520
    7  foo  three -0.862495  0.024580
    

    In the case of grouping by multiple keys, the group name will be a tuple:

    In [76]: for name, group in df.groupby(['A', 'B']):
       ....:     print(name)
       ....:     print(group)
       ....: 
    ('bar', 'one')
         A    B         C         D
    1  bar  one  0.254161  1.511763
    ('bar', 'three')
         A      B         C         D
    3  bar  three  0.215897 -0.990582
    ('bar', 'two')
         A    B         C         D
    5  bar  two -0.077118  1.211526
    ('foo', 'one')
         A    B         C         D
    0  foo  one -0.575247  1.346061
    6  foo  one -0.408530  0.268520
    ('foo', 'three')
         A      B         C        D
    7  foo  three -0.862495  0.02458
    ('foo', 'two')
         A    B         C         D
    2  foo  two -1.143704  1.627081
    4  foo  two  1.193555 -0.441652
    

    See Iterating through groups.

    Selecting a group#

    A single group can be selected using DataFrameGroupBy.get_group():

    In [77]: grouped.get_group("bar")
    Out[77]: 
         A      B         C         D
    1  bar    one  0.254161  1.511763
    3  bar  three  0.215897 -0.990582
    5  bar    two -0.077118  1.211526
    

    Or for an object grouped on multiple columns:

    In [78]: df.groupby(["A", "B"]).get_group(("bar", "one"))
    Out[78]: 
         A    B         C         D
    1  bar  one  0.254161  1.511763
    

    Aggregation#

    An aggregation is a GroupBy operation that reduces the dimension of the grouping object. The result of an aggregation is, or at least is treated as, a scalar value for each column in a group. For example, producing the sum of each column in a group of values.

    In [79]: animals = pd.DataFrame(
       ....:     {
       ....:         "kind": ["cat", "dog", "cat", "dog"],
       ....:         "height": [9.1, 6.0, 9.5, 34.0],
       ....:         "weight": [7.9, 7.5, 9.9, 198.0],
       ....:     }
       ....: )
       ....: 
    In [80]: animals
    Out[80]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0
    In [81]: animals.groupby("kind").sum()
    Out[81]: 
          height  weight
    cat     18.6    17.8
    dog     40.0   205.5
    

    In the result, the keys of the groups appear in the index by default. They can be instead included in the columns by passing as_index=False.

    In [82]: animals.groupby("kind", as_index=False).sum()
    Out[82]: 
      kind  height  weight
    0  cat    18.6    17.8
    1  dog    40.0   205.5
    

    Built-in aggregation methods#

    Many common aggregations are built-in to GroupBy objects as methods. Of the methods listed below, those with a * do not have an efficient, GroupBy-specific, implementation.

    Another aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index consists of the group names and the values are the sizes of each group.

    In [85]: grouped = df.groupby(["A", "B"])
    In [86]: grouped.size()
    Out[86]: 
    A    B    
    bar  one      1
         three    1
         two      1
    foo  one      2
         three    1
         two      2
    dtype: int64
    

    While the DataFrameGroupBy.describe() method is not itself a reducer, it can be used to conveniently produce a collection of summary statistics about each of the groups.

    In [87]: grouped.describe()
    Out[87]: 
                  C                      ...         D                    
              count      mean       std  ...       50%       75%       max
    A   B                                ...                              
    bar one     1.0  0.254161       NaN  ...  1.511763  1.511763  1.511763
        three   1.0  0.215897       NaN  ... -0.990582 -0.990582 -0.990582
        two     1.0 -0.077118       NaN  ...  1.211526  1.211526  1.211526
    foo one     2.0 -0.491888  0.117887  ...  0.807291  1.076676  1.346061
        three   1.0 -0.862495       NaN  ...  0.024580  0.024580  0.024580
        two     2.0  0.024925  1.652692  ...  0.592714  1.109898  1.627081
    [6 rows x 16 columns]
    

    Another aggregation example is to compute the number of unique values of each group. This is similar to the DataFrameGroupBy.value_counts() function, except that it only counts the number of unique values.

    In [88]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]
    In [89]: df4 = pd.DataFrame(ll, columns=["A", "B"])
    In [90]: df4
    Out[90]: 
    0  foo  1
    1  foo  2
    2  foo  2
    3  bar  1
    4  bar  1
    In [91]: df4.groupby("A")["B"].nunique()
    Out[91]: 
    bar    1
    foo    2
    Name: B, dtype: int64
    

    Aggregation functions will not return the groups that you are aggregating over as named columns when as_index=True, the default. The grouped columns will be the indices of the returned object.

    Passing as_index=False will return the groups that you are aggregating over as named columns, regardless if they are named indices or columns in the inputs.

    The aggregate() method#

    The aggregate() method can accept many different types of inputs. This section details using string aliases for various GroupBy methods; other inputs are detailed in the sections below.

    Any reduction method that pandas implements can be passed as a string to aggregate(). Users are encouraged to use the shorthand, agg. It will operate as if the corresponding method was called.

    In [92]: grouped = df.groupby("A")
    In [93]: grouped[["C", "D"]].aggregate("sum")
    Out[93]: 
                C         D
    bar  0.392940  1.732707
    foo -1.796421  2.824590
    In [94]: grouped = df.groupby(["A", "B"
    
    
    
    
        
    ])
    In [95]: grouped.agg("sum")
    Out[95]: 
                      C         D
    A   B                        
    bar one    0.254161  1.511763
        three  0.215897 -0.990582
        two   -0.077118  1.211526
    foo one   -0.983776  1.614581
        three -0.862495  0.024580
        two    0.049851  1.185429
    

    The result of the aggregation will have the group names as the new index. In the case of multiple keys, the result is a MultiIndex by default. As mentioned above, this can be changed by using the as_index option:

    In [96]: grouped = df.groupby(["A", "B"], as_index=False)
    In [97]: grouped.agg("sum")
    Out[97]: 
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429
    In [98]: df.groupby("A", as_index=False)[["C", "D"]].agg("sum")
    Out[98]: 
         A         C         D
    0  bar  0.392940  1.732707
    1  foo -1.796421  2.824590
    

    Note that you could use the DataFrame.reset_index() DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex, although this will make an extra copy.

    In [99]: df.groupby(["A", "B"]).agg("sum").reset_index()
    Out[99]: 
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429
    

    Aggregation with User-Defined Functions#

    Users can also provide their own User-Defined Functions (UDFs) for custom aggregations.

    Warning

    When aggregating with a UDF, the UDF should not mutate the provided Series. See Mutating with User Defined Function (UDF) methods for more information.

    Aggregating with a UDF is often less performant than using the pandas built-in methods on GroupBy. Consider breaking up a complex operation into a chain of operations that utilize the built-in methods.

    In [100]: animals
    Out[100]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0
    In [101]: animals.groupby("kind")[["height"]].agg(lambda x: set(x))
    Out[101]: 
               height
    cat    {9.1, 9.5}
    dog   {34.0, 6.0}
    

    The resulting dtype will reflect that of the aggregating function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as DataFrame construction.

    In [102]: animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())
    Out[102]: 
          height
    cat       18
    dog       40
    

    Applying multiple functions at once#

    On a grouped Series, you can pass a list or dict of functions to SeriesGroupBy.agg(), outputting a DataFrame:

    In [103]: grouped = df.groupby("A")
    In [104]: grouped["C"].agg(["sum", "mean", "std"])
    Out[104]: 
              sum      mean       std
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265
    

    On a grouped DataFrame, you can pass a list of functions to DataFrameGroupBy.agg() to aggregate each column, which produces an aggregated result with a hierarchical column index:

    In [105]: grouped[["C", "D"]].agg(["sum", "mean", "std"])
    Out[105]: 
                C                             D                    
              sum      mean       std       sum      mean       std
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785
    

    The resulting aggregations are named after the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this:

    In [106]: (
       .....:     grouped["C"]
       .....:     .agg(["sum", "mean", "std"])
       .....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
       .....: )
       .....: 
    Out[106]: 
              foo       bar       baz
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265
    

    For a grouped DataFrame, you can rename in a similar manner:

    In [107]: (
       .....:     grouped[["C", "D"]].agg(["sum", "mean", "std"]).rename(
       .....:         columns={"sum": "foo", "mean": "bar", "std": "baz"}
       .....:     )
       .....: )
       .....: 
    Out[107]: 
                C                             D                    
              foo       bar       baz       foo       bar       baz
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785
    

    In general, the output column names should be unique, but pandas will allow you apply to the same function (or two functions with the same name) to the same column.

    In [108]: grouped["C"].agg(["sum", "sum"])
    Out[108]: 
              sum       sum
    bar  0.392940  0.392940
    foo -1.796421 -1.796421
    

    pandas also allows you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending _<i> to each subsequent lambda.

    In [109]: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
    Out[109]: 
         <lambda_0>  <lambda_1>
    bar    0.331279    0.084917
    foo    2.337259   -0.215962
    

    Named aggregation#

    To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in DataFrameGroupBy.agg() and SeriesGroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names

  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

  • In [110]: animals
    Out[110]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0
    In [111]: animals.groupby("kind").agg(
       .....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
       .....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
       .....:     average_weight
    
    
    
    
        
    =pd.NamedAgg(column="weight", aggfunc="mean"),
       .....: )
       .....: 
    Out[111]: 
          min_height  max_height  average_weight
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75
    

    NamedAgg is just a namedtuple. Plain tuples are allowed as well.

    In [112]: animals.groupby("kind").agg(
       .....:     min_height=("height", "min"),
       .....:     max_height=("height", "max"),
       .....:     average_weight=("weight", "mean"),
       .....: )
       .....: 
    Out[112]: 
          min_height  max_height  average_weight
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75
    

    If the column names you want are not valid Python keywords, construct a dictionary and unpack the keyword arguments

    In [113]: animals.groupby("kind").agg(
       .....:     **{
       .....:         "total weight": pd.NamedAgg(column="weight", aggfunc="sum")
       .....:     }
       .....: )
       .....: 
    Out[113]: 
          total weight
    cat           17.8
    dog          205.5
    

    When using named aggregation, additional keyword arguments are not passed through to the aggregation functions; only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions require additional arguments, apply them partially with functools.partial().

    Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

    In [114]: animals.groupby("kind").height.agg(
       .....:     min_height="min",
       .....:     max_height="max",
       .....: )
       .....: 
    Out[114]: 
          min_height  max_height
    cat          9.1         9.5
    dog          6.0        34.0
    

    Applying different functions to DataFrame columns#

    By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

    In [115]: grouped.agg({"C": "sum", "D": lambda x: np.std(x, ddof=1)})
    Out[115]: 
                C         D
    bar  0.392940  1.366330
    foo -1.796421  0.884785
    

    The function names can also be strings. In order for a string to be valid it must be implemented on GroupBy:

    In [116]: grouped.agg({"C": "sum", "D": "std"})
    Out[116]: 
                C         D
    bar  0.392940  1.366330
    foo -1.796421  0.884785
    

    Transformation#

    A transformation is a GroupBy operation whose result is indexed the same as the one being grouped. Common examples include cumsum() and diff().

    In [117]: speeds
    Out[117]: 
              class           order  max_speed
    falcon     bird   Falconiformes      389.0
    parrot     bird  Psittaciformes       24.0
    lion     mammal       Carnivora       80.2
    monkey   mammal        Primates        NaN
    leopard  mammal       Carnivora       58.0
    In [118]: grouped = speeds.groupby("class")["max_speed"]
    In [119]: grouped.cumsum()
    Out[119]: 
    falcon     389.0
    parrot     413.0
    lion        80.2
    monkey       NaN
    leopard    138.2
    Name: max_speed, dtype: float64
    In [120]: grouped.diff()
    Out[120]: 
    falcon       NaN
    parrot    -365.0
    lion         NaN
    monkey       NaN
    leopard      NaN
    Name: max_speed, dtype: float64
    

    Unlike aggregations, the groupings that are used to split the original object are not included in the result.

    Since transformations do not include the groupings that are used to split the result, the arguments as_index and sort in DataFrame.groupby() and Series.groupby() have no effect.

    A common use of a transformation is to add the result back into the original DataFrame.

    In [121]: result = speeds.copy()
    In [122]: result["cumsum"] = grouped.cumsum()
    In [123]: result["diff"] = grouped.diff()
    In [124]: result
    Out[124]: 
              class           order  max_speed  cumsum   diff
    falcon     bird   Falconiformes      389.0   389.0    NaN
    parrot     bird  Psittaciformes       24.0   413.0 -365.0
    lion     mammal       Carnivora       80.2    80.2    NaN
    monkey   mammal        Primates        NaN     NaN    NaN
    leopard  mammal       Carnivora       58.0   138.2    NaN
    

    Built-in transformation methods#

    The following methods on GroupBy act as transformations.

    In addition, passing any built-in aggregation method as a string to transform() (see the next section) will broadcast the result across the group, producing a transformed result. If the aggregation method has an efficient implementation, this will be performant as well.

    The transform() method#

    Similar to the aggregation method, the transform() method can accept string aliases to the built-in transformation methods in the previous section. It can also accept string aliases to the built-in aggregation methods. When an aggregation method is provided, the result will be broadcast across the group.

    In [125]: speeds
    Out[125]: 
              class           order  max_speed
    falcon     bird   Falconiformes      389.0
    parrot     bird  Psittaciformes       24.0
    lion     mammal       Carnivora       80.2
    monkey   mammal        Primates        NaN
    leopard  mammal       Carnivora       58.0
    In [126]: grouped = speeds.groupby("class")[["max_speed"]]
    In [127]: grouped.transform("cumsum")
    Out[127]: 
             max_speed
    falcon       389.0
    parrot       413.0
    lion          80.2
    monkey         NaN
    leopard      138.2
    In [128]: grouped.transform("sum")
    Out[128]: 
             max_speed
    falcon       413.0
    parrot       413.0
    lion         138.2
    monkey       138.2
    leopard      138.2
    

    In addition to string aliases, the transform() method can also accept User-Defined Functions (UDFs). The UDF must:

  • Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).

  • Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.

  • Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. See Mutating with User Defined Function (UDF) methods for more information.

  • (Optionally) operates on all columns of the entire group chunk at once. If this is supported, a fast path is used starting from the second chunk.

  • Transforming by supplying transform with a UDF is often less performant than using the built-in methods on GroupBy. Consider breaking up a complex operation into a chain of operations that utilize the built-in methods.

    All of the examples in this section can be made more performant by calling built-in methods instead of using UDFs. See below for examples.

    Changed in version 2.0.0: When using .transform on a grouped DataFrame and the transformation function returns a DataFrame, pandas now aligns the result’s index with the input’s index. You can call .to_numpy() within the transformation function to avoid alignment.

    Similar to The aggregate() method, the resulting dtype will reflect that of the transformation function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as DataFrame construction.

    Suppose we wish to standardize the data within each group:

    In [129]: index = pd.date_range("10/1/1999", periods=1100)
    In [130]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
    In [131]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()
    In [132]: ts.head()
    Out[132]: 
    2000-01-08    0.779333
    2000-01-09    0.778852
    2000-01-10    0.786476
    2000-01-11    0.782797
    2000-01-12    0.798110
    Freq: D, dtype: float64
    In [133]: ts.tail()
    Out[133]: 
    2002-09-30    0.660294
    2002-10-01    0.631095
    2002-10-02    0.673601
    2002-10-03    0.709213
    2002-10-04    0.719369
    Freq: D, dtype: float64
    In [134]: transformed = ts.groupby(lambda x: x.year).transform(
       .....:     lambda x: (x - x.mean()) / x.std()
       .....: )
       .....: 
    

    We would expect the result to now have mean 0 and standard deviation 1 within each group (up to floating-point error), which we can easily check:

    # Original Data
    In [135]: grouped = ts.groupby(lambda x: x.year)
    In [136]: grouped.mean()
    Out[136]: 
    2000    0.442441
    2001    0.526246
    2002    0.459365
    dtype: float64
    In [137]: grouped.std()
    Out[137]: 
    2000    0.131752
    2001    0.210945
    2002    0.128753
    dtype: float64
    # Transformed Data
    In [138]: grouped_trans = transformed.groupby(lambda x: x.year)
    In [139]: grouped_trans.mean()
    Out[139]: 
    2000   -4.870756e-16
    2001   -1.545187e-16
    2002    4.136282e-16
    dtype: float64
    In [140]: grouped_trans.std()
    Out[140]: 
    2000    1.0
    2001    1.0
    2002    1.0
    dtype: float64
    

    We can also visually compare the original and transformed data sets.

    In [141]: compare = pd.DataFrame({"Original": ts, "Transformed": transformed})
    In [142]: compare.plot()
    Out[142]: <Axes: >
    

    Transformation functions that have lower dimension outputs are broadcast to match the shape of the input array.

     
    推荐文章