Python 3home |
The function is called with each value in a row or column.
Sometimes our computation is more complex than simple math, or we need to apply a function to each element. We can use apply():
import pandas as pd
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
# r4 4 2.5 d
df['d'] = df['c'].apply(str.upper)
print(df)
# a b c d
# r1 1 1.0 a A
# r2 2 1.5 b B
# r3 3 2.0 c C
# r4 4 2.5 d D
We use a custom named function or a lambda with apply():
print(df)
# a b c d
# r1 1 1.0 a A
# r2 2 1.5 b B
# r3 3 2.0 c C
# r4 4 2.5 d D
df['e'] = df['a'].apply(lambda x: '$' + str(x * 1000) )
print(df)
# a b c d e
# r1 1 1.0 a A $1000
# r2 2 1.5 b B $2000
# r3 3 2.0 c C $3000
# r4 4 2.5 d D $4000
See below for an explanation of lambda.
Compare these two functions, both of which add/concatenate their arguments:
def addthese(x, y):
return x + y
addthese2 = lambda x, y: x + y
print(addthese(5, 9)) # 14
print(addthese2(5, 9)) # 14
The function definition and the lambda statement are equivalent - they both produce a function with the same functionality.
Calculating a sum or count based on values in 2 or more columns.
To aggregate by values in two combined columns, simply pass a list of columns by which to aggregate -- the result is called a "multi-column aggregation":
print(df.groupby(['region', 'revtype']).sum())
# revenue
# region revtype
# NE retail 12
# wholesale 16
# NW retail 18
# wholesale 15
# SW retail 3
# wholesale 10
Note that the index has 2 columns (you can tell in that the tops of the columns are 'recessed' beneath the column row). This is a MultiIndex or hierarchical index. In the above example the NE stands over both retail and wholesale in the first 2 rows -- we should read this as NE-retail and NE-wholesale.
Like passing a function to sorted(), we can pass a function to df.groupby()
df = pd.DataFrame( { 'company': [ 'Alpha', 'Alpha', 'Alpha',
'Beta', 'Beta', 'Beta',
'Gamma', 'Gamma', 'Gamma'],
'region': [ 'NE', 'NW', 'SW', 'NW', 'SW',
'NE', 'NE', 'SW', 'NW'],
'revenue': [ 10, 9, 2, 15, 8, 2, 16, 3, 9],
'revtype': [ 'retail', 'retail', 'wholesale',
'wholesale', 'wholesale', 'retail',
'wholesale', 'retail', 'retail' ] } )
print(df)
# company region revenue revtype
# 0 Alpha NE 10 retail
# 1 Alpha NW 9 retail
# 2 Alpha SW 2 wholesale
# 3 Beta NW 15 wholesale
# 4 Beta SW 8 wholesale
# 5 Beta NE 2 retail
# 6 Gamma NE 16 wholesale
# 7 Gamma SW 3 retail
# 8 Gamma NW 9 retail
groupby() functions using apply() We can design our own custom functions -- we simply use apply() and pass a function (you might remember similarly passing a function from the key= argument to sorted()). Here is the equivalent of the sum() function, written as a custom function:
def get_sum(df_slice):
return sum(df_slice['revenue'])
print(df.groupby('region').apply(get_sum)) # custom function: same as groupby('region').sum()
# region
# NE 28
# NW 33
# SW 13
# dtype: int64
As was done with sorted(), pandas calls our groupby function multiple times, once with each group. The argument that Python passes to our custom function is a dataframe slice containing just the rows from a single grouping -- in this case, a specific region (i.e., it will be called once with a silce of NE rows, once with NW rows, etc. The function should be made to return the desired value for that slice -- in this case, we want to see the sum of the revenue column (as mentioned, this is simply illustrating a function that does the same work as the built-in .sum() function). (For a better view on what is happening with the function, print df_slice inside the function -- you will see the values in each slice printed.) Here is a custom function that returns the median ("middle value") for each region:
def get_median(df):
listvals = sorted(list(df['revenue']))
lenvals = len(listvals)
midval = listvals[ int(lenvals / 2) ]
return midval
print(df.groupby('region').apply(get_median))
# region
# NE 10
# NW 9
# SW 3
# dtype: int64
Standard aggregations group rows based on a column value ('NW', 'SW', etc.) or a combination of column values. If more work is needed to identify a group, we can supply a custom function for this operation as well. Perhaps we'd like to group our rows by whether or not they achieved a certain revenue target within a region. Basically we want to group each row by whether the value is 10 or greater (i.e., 10 or more for a company/region/revenue type). Our function will simply return the number of decimal places in the value. So, we can process this column value (or even include other column values) by referencing a function in the call to groupby():
def bydecplace(idx):
row = df.loc[idx] # a Series with the row values for this index
return(len(str(row['revenue']))) # '2' if 10; '1' if 9
print(df.groupby(bydecplace).sum())
# revenue
# 1 33
# 2 41
The value passed to the function is the index of a row. We can thus use the .loc attribute with the index value to access the row. This function isolates the revenue within the row and returns its string length. using lambdas as groupby() or grouping functions Of course any of these simple functions can be rewritten as a lambda (and in many cases, should be, as in the above case since the function references the dataframe directly, and we should prefer not to refer to outside variables in a standard function):
def bydecplace(idx):
row = df.loc[idx] # a Series with the row values for this index
return(len(str(row['revenue']))) # '2' if 10; '1' if 9
print(df.groupby(lambda idx: len(str(df.loc[idx]['revenue']))).sum())
# revenue
# alpha 21
# beta 25
# gamma 28
An Index object is used to specify a DataFrame's columns or index, or a Series' index.
Columns and Indices
A DataFrame makes use of two Index objects: one to represent the columns, and one to represent the rows.
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': [100, 200, 300, 400] },
index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c d
# r1 1 1.0 a 100
# r2 2 1.5 b 200
# r3 3 2.0 c 300
# r4 4 2.5 d 400
.rename() method: columns or index labels can be reset using this DataFrame method.
df = df.rename(columns={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'},
index={'r1': 'R1', 'r2': 'R2', 'r3': 'R3', 'r4': 'R4'})
print(df)
# A B C D
# R1 1 1.0 a 100
# R2 2 1.5 b 200
# R3 3 2.0 c 300
# R4 4 2.5 d 400
.columns, .index: the columns or index can also be set directly using the DataFrame's attributes (this would have the same effect as above):
df.columns = ['A', 'B', 'C', 'D']
df.index = ['r1', 'r2', 'r3', 'r4']
.set_index(): set any column to the index
df2 = df.set_index('A')
print(df2)
# B C D
# A
# 1 1.0 a 100
# 2 1.5 b 200
# 3 2.0 c 300
# 4 2.5 d 400
.reset_index(): we can reset the index to integers starting from 0; by default this converts the previous into a new column:
df3 = df.reset_index()
print(df3)
# index A B C D
# 0 R1 1 1.0 a 100
# 1 R2 2 1.5 b 200
# 2 R3 3 2.0 c 300
# 3 R4 4 2.5 d 400
or to drop the index when resetting, include drop=True
df4 = df.reset_index(drop=True)
print(df4)
# A B C D
# 0 1 1.0 a 100
# 1 2 1.5 b 200
# 2 3 2.0 c 300
# 3 4 2.5 d 400
.reindex(): we can change the order of the indices and thus the rows:
df5 = df.reindex(reversed(df.index))
df5 = df5.reindex(columns=reversed(df.columns))
print(df5)
# state A B C D
# year
# R1 1 1.0 a 100
# R2 2 1.5 b 200
# R3 3 2.0 c 300
# R4 4 2.5 d 400
we can set names for index and column indices:
df.index.name = 'year'
df.columns.name = 'state'
There are a number of "exotic" Index object types: Index (standard, default and most common Index type) RangeIndex (index built from an integer range) Int64Index, UInt64Index, Float64Index (index values of specific types) DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex (datetime-related indices) CategoricalIndex (index related to the Categorical type)
In a MultiIndex, we can think of a column or row label as having two items.
A MultiIndex specifies an "index within an index" or "column within a column" for more sophisticated labeling of data.
A DataFrame with multi-index columns this and that and multi-index index other and another
this a b that 1 2 1 2 other another x 1 -1.618192 1.040778 0.191557 -0.698187 2 0.924018 0.517304 0.518304 -0.441154 y 1 -0.002322 -0.157659 -0.169507 -1.088491 2 0.216550 1.428587 1.155101 -1.610666
The MultiIndex can be generated by a multi-column aggregation, or it can be set directly, as below.
The from_tuples() method creates a MultiIndex from tuple pairs that represent levels of the MultiIndex:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays)) # zip two lists like a zipper
# [('bar', 'one'),
# ('bar', 'two'),
# ('baz', 'one'),
# ('baz', 'two'),
# ('foo', 'one'),
# ('foo', 'two'),
# ('qux', 'one'),
# ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
# MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
# codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
# names=['first', 'second'])
The notation above is somewhat hard to read; the codes= parameter specifies which of the two levels= lists values appears in each tuple pair.
Here we're applying the above index to a Series object:
s = pd.Series(np.random.randn(8), index=index)
# first second
# bar one 0.469112
# two -0.282863
# baz one -1.509059
# two -1.135632
# foo one 1.212112
# two -0.173215
# qux one 0.119209
# two -1.044236
# dtype: float64
Slicing works more or less as expected; tuples help us specify Multilevel indices.
mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)
# A B C D
# first second
# bar one -0.231171 0.340523 0.472207 -0.543819
# two 0.113923 0.367657 0.171424 -0.039921
# baz one -0.625282 -0.791371 -0.487958 0.568405
# two -1.128698 -1.040629 2.536821 -0.844057
# foo one -1.319797 -1.277551 -0.614919 1.305367
# two 0.414166 -0.427726 0.929567 -0.524161
# qux one 1.859414 -0.190417 -1.824712 0.454862
# two -0.169519 -0.850846 -0.444302 -0.577360
standard slicing
df['A']
# first second
# bar one -0.231171
# two 0.113923
# baz one -0.625282
# two -1.128698
# foo one -1.319797
# two 0.414166
# qux one 1.859414
# two -0.169519
# Name: A, dtype: float64
df.loc['bar']
# A B C D
# second
# one -0.231171 0.340523 0.472207 -0.543819
# two 0.113923 0.367657 0.171424 -0.039921
df.loc[('bar', 'one')] # also: df.loc['bar'].loc['one']
# A -0.231171
# B 0.340523
# C 0.472207
# D -0.543819
# Name: (bar, one), dtype: float64
df.loc[('bar', 'two'), 'A']
# 0.11392342023306047
The 'level' parameter allows slicing along a lower level
mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)
# A B C D
# first second
# bar one -0.231171 0.340523 0.472207 -0.543819
# two 0.113923 0.367657 0.171424 -0.039921
# baz one -0.625282 -0.791371 -0.487958 0.568405
# two -1.128698 -1.040629 2.536821 -0.844057
# foo one -1.319797 -1.277551 -0.614919 1.305367
# two 0.414166 -0.427726 0.929567 -0.524161
# qux one 1.859414 -0.190417 -1.824712 0.454862
# two -0.169519 -0.850846 -0.444302 -0.577360
# standard slicing
df.xs('bar')
# A B C D
# second
# one -0.231171 0.340523 0.472207 -0.543819
# two 0.113923 0.367657 0.171424 -0.039921
df.xs(('baz', 'two'))
# A -1.128698
# B -1.040629
# C 2.536821
# D -0.844057
# Name: (baz, two), dtype: float64
# using the level= parameter
df.xs('two', level='second')
# A B C D
# first
# bar 0.113923 0.367657 0.171424 -0.039921
# baz -1.128698 -1.040629 2.536821 -0.844057
# foo 0.414166 -0.427726 0.929567 -0.524161
# qux -0.169519 -0.850846 -0.444302 -0.577360