Python 3

home

pandas: Advanced groupby(), apply() and MultiIndex

Series.apply(): apply a function call across a vector

The function is called with each value in a row or column.


Sometimes our computation is more complex than simple math, or we need to apply a function to each element. We can use apply():

import pandas as pd

df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [1.0, 1.5, 2.0, 2.5],
                    'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )

print(df)

                  #     a    b  c
                  # r1  1  1.0  a
                  # r2  2  1.5  b
                  # r3  3  2.0  c
                  # r4  4  2.5  d


df['d'] = df['c'].apply(str.upper)

print(df)
                  #     a    b  c  d
                  # r1  1  1.0  a  A
                  # r2  2  1.5  b  B
                  # r3  3  2.0  c  C
                  # r4  4  2.5  d  D




apply() with custom function or lambda

We use a custom named function or a lambda with apply():


print(df)
                  #     a    b  c  d
                  # r1  1  1.0  a  A
                  # r2  2  1.5  b  B
                  # r3  3  2.0  c  C
                  # r4  4  2.5  d  D


df['e'] = df['a'].apply(lambda x: '$' + str(x * 1000) )

print(df)
                  #     a    b  c  d      e
                  # r1  1  1.0  a  A  $1000
                  # r2  2  1.5  b  B  $2000
                  # r3  3  2.0  c  C  $3000
                  # r4  4  2.5  d  D  $4000

See below for an explanation of lambda.

Review: lambda expressions

A lambda describes a function in shorthand.


Compare these two functions, both of which add/concatenate their arguments:

def addthese(x, y):
    return x + y

addthese2 = lambda x, y:  x + y

print(addthese(5, 9))        # 14
print(addthese2(5, 9))       # 14

The function definition and the lambda statement are equivalent - they both produce a function with the same functionality.





Advanced groupby(): multi-column aggregation

Calculating a sum or count based on values in 2 or more columns.


To aggregate by values in two combined columns, simply pass a list of columns by which to aggregate -- the result is called a "multi-column aggregation":


print(df.groupby(['region', 'revtype']).sum())

                  #                   revenue
                  # region revtype
                  # NE     retail          12
                  #        wholesale       16
                  # NW     retail          18
                  #        wholesale       15
                  # SW     retail           3
                  #        wholesale       10

Note that the index has 2 columns (you can tell in that the tops of the columns are 'recessed' beneath the column row). This is a MultiIndex or hierarchical index. In the above example the NE stands over both retail and wholesale in the first 2 rows -- we should read this as NE-retail and NE-wholesale.





grouping functions: use a custom summary function

Like passing a function to sorted(), we can pass a function to df.groupby()


df = pd.DataFrame( { 'company': [ 'Alpha', 'Alpha', 'Alpha',
                                  'Beta', 'Beta', 'Beta',
                                  'Gamma', 'Gamma', 'Gamma'],
                     'region':  [ 'NE', 'NW', 'SW', 'NW', 'SW',
                                  'NE', 'NE', 'SW', 'NW'],
                     'revenue': [ 10, 9, 2, 15, 8, 2, 16, 3, 9],
                     'revtype': [ 'retail', 'retail', 'wholesale',
                                  'wholesale', 'wholesale', 'retail',
                                  'wholesale', 'retail', 'retail'     ] } )

print(df)

                  #   company region  revenue    revtype
                  # 0   Alpha     NE       10     retail
                  # 1   Alpha     NW        9     retail
                  # 2   Alpha     SW        2  wholesale
                  # 3    Beta     NW       15  wholesale
                  # 4    Beta     SW        8  wholesale
                  # 5    Beta     NE        2     retail
                  # 6   Gamma     NE       16  wholesale
                  # 7   Gamma     SW        3     retail
                  # 8   Gamma     NW        9     retail

groupby() functions using apply() We can design our own custom functions -- we simply use apply() and pass a function (you might remember similarly passing a function from the key= argument to sorted()). Here is the equivalent of the sum() function, written as a custom function:


def get_sum(df_slice):
    return sum(df_slice['revenue'])

print(df.groupby('region').apply(get_sum))  # custom function: same as groupby('region').sum()

                  # region
                  # NE    28
                  # NW    33
                  # SW    13
                  # dtype: int64

As was done with sorted(), pandas calls our groupby function multiple times, once with each group. The argument that Python passes to our custom function is a dataframe slice containing just the rows from a single grouping -- in this case, a specific region (i.e., it will be called once with a silce of NE rows, once with NW rows, etc. The function should be made to return the desired value for that slice -- in this case, we want to see the sum of the revenue column (as mentioned, this is simply illustrating a function that does the same work as the built-in .sum() function). (For a better view on what is happening with the function, print df_slice inside the function -- you will see the values in each slice printed.) Here is a custom function that returns the median ("middle value") for each region:


def get_median(df):
    listvals = sorted(list(df['revenue']))
    lenvals = len(listvals)
    midval = listvals[ int(lenvals / 2) ]
    return midval

print(df.groupby('region').apply(get_median))

                  # region
                  # NE    10
                  # NW     9
                  # SW     3
                  # dtype: int64




grouping functions: use a function to identify a group

Standard aggregations group rows based on a column value ('NW', 'SW', etc.) or a combination of column values. If more work is needed to identify a group, we can supply a custom function for this operation as well. Perhaps we'd like to group our rows by whether or not they achieved a certain revenue target within a region. Basically we want to group each row by whether the value is 10 or greater (i.e., 10 or more for a company/region/revenue type). Our function will simply return the number of decimal places in the value. So, we can process this column value (or even include other column values) by referencing a function in the call to groupby():


def bydecplace(idx):
    row = df.loc[idx]                 # a Series with the row values for this index
    return(len(str(row['revenue'])))  # '2' if 10; '1' if 9

print(df.groupby(bydecplace).sum())
                  #      revenue
                  #   1       33
                  #   2       41

The value passed to the function is the index of a row. We can thus use the .loc attribute with the index value to access the row. This function isolates the revenue within the row and returns its string length. using lambdas as groupby() or grouping functions Of course any of these simple functions can be rewritten as a lambda (and in many cases, should be, as in the above case since the function references the dataframe directly, and we should prefer not to refer to outside variables in a standard function):


def bydecplace(idx):
    row = df.loc[idx]                 # a Series with the row values for this index
    return(len(str(row['revenue'])))  # '2' if 10; '1' if 9

print(df.groupby(lambda idx:  len(str(df.loc[idx]['revenue']))).sum())

                  #        revenue
                  # alpha       21
                  # beta        25
                  # gamma       28




Review: the Index -- DataFrame Column or Index Labels

An Index object is used to specify a DataFrame's columns or index, or a Series' index.


Columns and Indices


A DataFrame makes use of two Index objects: one to represent the columns, and one to represent the rows.

df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [1.0, 1.5, 2.0, 2.5],
                    'c': ['a', 'b', 'c', 'd'],
                    'd': [100, 200, 300, 400] },
                    index=['r1', 'r2', 'r3', 'r4'] )
print(df)
    #     a    b  c    d
    # r1  1  1.0  a  100
    # r2  2  1.5  b  200
    # r3  3  2.0  c  300
    # r4  4  2.5  d  400

.rename() method: columns or index labels can be reset using this DataFrame method.

df = df.rename(columns={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'},
               index={'r1': 'R1', 'r2': 'R2', 'r3': 'R3', 'r4': 'R4'})
print(df)
    #     A    B  C    D
    # R1  1  1.0  a  100
    # R2  2  1.5  b  200
    # R3  3  2.0  c  300
    # R4  4  2.5  d  400

.columns, .index: the columns or index can also be set directly using the DataFrame's attributes (this would have the same effect as above):

df.columns = ['A', 'B', 'C', 'D']
df.index = ['r1', 'r2', 'r3', 'r4']

.set_index(): set any column to the index

df2 = df.set_index('A')
print(df2)
    #      B  C    D
    # A
    # 1  1.0  a  100
    # 2  1.5  b  200
    # 3  2.0  c  300
    # 4  2.5  d  400

.reset_index(): we can reset the index to integers starting from 0; by default this converts the previous into a new column:

df3 = df.reset_index()
print(df3)
    #   index  A    B  C    D
    # 0    R1  1  1.0  a  100
    # 1    R2  2  1.5  b  200
    # 2    R3  3  2.0  c  300
    # 3    R4  4  2.5  d  400

or to drop the index when resetting, include drop=True

df4 = df.reset_index(drop=True)
print(df4)
    #    A    B  C    D
    # 0  1  1.0  a  100
    # 1  2  1.5  b  200
    # 2  3  2.0  c  300
    # 3  4  2.5  d  400

.reindex(): we can change the order of the indices and thus the rows:

df5 = df.reindex(reversed(df.index))

df5 = df5.reindex(columns=reversed(df.columns))

print(df5)

                      #  state  A    B  C    D
                      #  year
                      #  R1     1  1.0  a  100
                      #  R2     2  1.5  b  200
                      #  R3     3  2.0  c  300
                      #  R4     4  2.5  d  400

we can set names for index and column indices:

df.index.name = 'year'
df.columns.name = 'state'

There are a number of "exotic" Index object types: Index (standard, default and most common Index type) RangeIndex (index built from an integer range) Int64Index, UInt64Index, Float64Index (index values of specific types) DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex (datetime-related indices) CategoricalIndex (index related to the Categorical type)





The MultiIndex: a sequence of tuples

In a MultiIndex, we can think of a column or row label as having two items.


A MultiIndex specifies an "index within an index" or "column within a column" for more sophisticated labeling of data.


A DataFrame with multi-index columns this and that and multi-index index other and another

this                  a                   b
that                  1         2         1         2
other another
x     1       -1.618192  1.040778  0.191557 -0.698187
      2        0.924018  0.517304  0.518304 -0.441154
y     1       -0.002322 -0.157659 -0.169507 -1.088491
      2        0.216550  1.428587  1.155101 -1.610666

The MultiIndex can be generated by a multi-column aggregation, or it can be set directly, as below.


The from_tuples() method creates a MultiIndex from tuple pairs that represent levels of the MultiIndex:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))    # zip two lists like a zipper

#  [('bar', 'one'),
#   ('bar', 'two'),
#   ('baz', 'one'),
#   ('baz', 'two'),
#   ('foo', 'one'),
#   ('foo', 'two'),
#   ('qux', 'one'),
#   ('qux', 'two')]


index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

#   MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
#              codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
#              names=['first', 'second'])

The notation above is somewhat hard to read; the codes= parameter specifies which of the two levels= lists values appears in each tuple pair.


Here we're applying the above index to a Series object:

s = pd.Series(np.random.randn(8), index=index)

#   first  second
#   bar    one       0.469112
#          two      -0.282863
#   baz    one      -1.509059
#          two      -1.135632
#   foo    one       1.212112
#          two      -0.173215
#   qux    one       0.119209
#          two      -1.044236
#   dtype: float64




Slicing a MultiIndex

Slicing works more or less as expected; tuples help us specify Multilevel indices.


mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
                       codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
                       names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)

#                        A         B         C         D
#   first second
#   bar   one    -0.231171  0.340523  0.472207 -0.543819
#         two     0.113923  0.367657  0.171424 -0.039921
#   baz   one    -0.625282 -0.791371 -0.487958  0.568405
#         two    -1.128698 -1.040629  2.536821 -0.844057
#   foo   one    -1.319797 -1.277551 -0.614919  1.305367
#         two     0.414166 -0.427726  0.929567 -0.524161
#   qux   one     1.859414 -0.190417 -1.824712  0.454862
#         two    -0.169519 -0.850846 -0.444302 -0.577360

standard slicing

df['A']
#   first  second
#   bar    one      -0.231171
#          two       0.113923
#   baz    one      -0.625282
#          two      -1.128698
#   foo    one      -1.319797
#          two       0.414166
#   qux    one       1.859414
#          two      -0.169519
#   Name: A, dtype: float64


df.loc['bar']
#                  A         B         C         D
#   second
#   one    -0.231171  0.340523  0.472207 -0.543819
#   two     0.113923  0.367657  0.171424 -0.039921


df.loc[('bar', 'one')]             # also:  df.loc['bar'].loc['one']
#   A   -0.231171
#   B    0.340523
#   C    0.472207
#   D   -0.543819
#   Name: (bar, one), dtype: float64


df.loc[('bar', 'two'), 'A']
#   0.11392342023306047




'cross-section' slicing with .xs

The 'level' parameter allows slicing along a lower level


mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
                       codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
                       names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)

#                        A         B         C         D
#   first second
#   bar   one    -0.231171  0.340523  0.472207 -0.543819
#         two     0.113923  0.367657  0.171424 -0.039921
#   baz   one    -0.625282 -0.791371 -0.487958  0.568405
#         two    -1.128698 -1.040629  2.536821 -0.844057
#   foo   one    -1.319797 -1.277551 -0.614919  1.305367
#         two     0.414166 -0.427726  0.929567 -0.524161
#   qux   one     1.859414 -0.190417 -1.824712  0.454862
#         two    -0.169519 -0.850846 -0.444302 -0.577360


# standard slicing
df.xs('bar')
#                  A         B         C         D
#   second
#   one    -0.231171  0.340523  0.472207 -0.543819
#   two     0.113923  0.367657  0.171424 -0.039921

df.xs(('baz', 'two'))
#   A   -1.128698
#   B   -1.040629
#   C    2.536821
#   D   -0.844057
#   Name: (baz, two), dtype: float64


# using the level= parameter
df.xs('two', level='second')
#                 A         B         C         D
#   first
#   bar    0.113923  0.367657  0.171424 -0.039921
#   baz   -1.128698 -1.040629  2.536821 -0.844057
#   foo    0.414166 -0.427726  0.929567 -0.524161
#   qux   -0.169519 -0.850846 -0.444302 -0.577360




[pr]