Python 3home |
Use a subscript (or attribute) to access columns by label; use the .loc[] or .iloc[] attributes to access rows by label or integer index.
a DataFrame:
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': [100, 200, 300, 400] },
index=['r1', 'r2', 'r3', 'r4'] )
access column as Series:
cola = df['a'] # Series with [1, 2, 3, 4] and index ['r1', 'r2', 'r3', 'r4']
cola = df.a # same -- can often use attribute labels for column name
print(cola)
# r1 1
# r2 2
# r3 3
# r4 4
# Name: a, dtype: int64
access row as Series using index label 'r2':
row2 = df.loc['r2'] # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd']
access row as Series using integer index:
row2 = df.iloc[1] # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd'] (same as above)
print(row2)
# a 2
# b 1.5
# c b
# d 200
# Name: r2, dtype: object
(Note that the .ix DataFrame indexer is a legacy feature and is deprecated.)
DataFrames can be sliced along a column or row (Series) or both (DataFrame)
Access a Series object through DataFrame column or index labels Again, we can apply any Series operation on any of the Series within a DataFrame - slice, access by Index, etc.
dfi = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': [100, 200, 300, 400] },
index=['r1', 'r2', 'r3', 'r4'] )
print(dfi['b'])
# r1 1.0
# r2 1.5
# r3 2.0
# r4 2.5
# Name: b
# print(df['b'][0:3])
# r1 1.0
# r2 1.5
# r3 2.0
# dfi['b']['r2']
# 1.5
Create a DataFrame from columns of another DataFrame Oftentimes we want to eliminate one or more columns from our DataFrame. We do this by slicing Series out of the DataFrame, to produce a new DataFrame:
>>> dfi[['a', 'c']] a c r1 1 a r2 2 b r3 3 c r4 4 d
Far less often we may want to isolate a row from a DataFrame - this is also returned to us as a Series. Note the column labels have become the Series index, and the row label becomes the Series Name. 2-dimensional slicing A double subscript can select a 2-dimensional slice (some rows and some columns).
df[['a', 'b']]['alpha': 'gamma']
Also note carefully the list inside the first square brackets.
Oftentimes we want to select rows based on row criteria (i.e., conditionally). To do this, we establish a boolean test, placed within subscript-like square brackets.
Selecting rows based on column criteria:
import pandas as pd
df = pd.DataFrame( { 'a': [1, 2, 3, 4],
'b': [-1.0, -1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c
# r1 1 -1.0 a
# r2 2 -1.5 b
# r3 3 2.0 c
# r4 4 2.5 d
print(df[ df['b'] < 0 ]) # select rows where 'b' value is < 0
# a b c
# r1 1 -1.0 a
# r2 2 -1.5 b
The boolean test by itself returns a boolean Series. Its values indicate whether the test return True for the value in the tested Series. This test can of course be assigned to a name and used by name, which is common for complex criteria:
b_series = df['a'] > 2
print(b_series) # we are printing this just for illustration
# r1 False
# r2 False
# r3 True
# r4 True
# Name: a, dtype: bool
print(df[ b_series ])
# a b c
# r3 3 2.0 c
# r4 4 2.5 d
negating a boolean test
a tilde (~) in front of a boolean test creates its inverse:
b_test = df['a'] > 2
print(df[ ~b_test ])
# a b c
# r1 1 -1.0 a
# r2 2 -1.5 b
compound tests use & for 'and', | for 'or', and ( ) to separate tests
The parentheses are needed to disambiguate the parts of the compound test.
print(df[ (df.a > 3) & (df.a < 5) ])
# a b c
# r4 4 2.5 d
An enigmatic warning that has bedeviled pandas coders.
We often begin work by reading a large dataset into a DataFrame, then slicing out a meaningful subset (eliminating columns and rows that are irrelevant to our analysis). Then we may wish to make some changes to the slice, or add columns to the slice. A recurrent problem in working with slices is that standard slicing may produce a link into the original data, or it may produce a temporary "copy". If a change is made to a temporary copy, our working data will not be changed.
Here we are creating a slice by using a double subscript:
dfi = pd.DataFrame({'c1': [0, 1, 2, 3, 4],
'c2': [5, 6, 7, 8, 9],
'c3': [10, 11, 12, 13, 14],
'c4': [15, 16, 17, 18, 19],
'c5': [20, 21, 22, 23, 24],
'c6': [25, 26, 27, 28, 29] },
index = ['r1', 'r2', 'r3', 'r4', 'r5'])
dfi_prime = dfi[ dfi.c1 > 2 ]
print(dfi_prime)
# c1 c2 c3 c4 c5 c6
# r4 3 8 13 18 23 28
# r5 4 9 14 19 24 29
dfi_prime.c3 = dfi_prime.c1 * dfi_prime.c2
A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Note that in some cases this warning will not appear, and in others the warning will appear and yet the change will have taken effect.
The same problem may occur with a simple slice selection:
myslice = dfi[ ['c1', 'c2', 'c3'] ]
print myslice
# c1 c2 c3 c4 c5 c6
# r3 2 7 12 17 22 27
# r4 3 8 13 18 23 28
# r5 4 9 14 19 24 29
myslice.c3 = myslice.c1 * myslice.c2
print(myslice)
The problem here is that pandas cannot guarantee whether the slice is a view on the original data, or a temporary copy! If a temporary copy, a change will not take effect! What's particularly problematic about this warning is that we may not always see it in these situations. We may also see false positives and false negatives, as is acknowledged in the documentation.
Set the "Copy on Write" configuration, or use .loc[]
There are two solutions to the 'copy of a slice' warning (and its attendant uncertainty).
The recommended solution is to enable the "Copy on Write" functionality, which causes all write operations on slices to work on copies, so the underlying data does not change:
pd.options.mode.copy_on_write = True
myslice.c3 = myslice.c1 * myslice.c2
print(myslice)
Another solution is to use .loc or .iloc:
filtered = dfi.loc[ dfi.c3 > 11, : ] # filter by column, include all rows
myslice.c3 = filtered.c1 * filtered.c2
print(myslice)
Keep in mind that you may get a warning even with this approach; you can consider it a false positive (i.e., disregard it). More details about .loc are in the next section. It's important to note that Copy on Write will become the default behavior starting with pandas 3.0.
If a slice is to changed, it should be derived using .loc[] rather than slicing.
Again, starting with this DataFrame:
dfi = pd.DataFrame({'c1': [0, 1, 2, 3, 4],
'c2': [5, 6, 7, 8, 9],
'c3': [10, 11, 12, 13, 14],
'c4': [15, 16, 17, 18, 19],
'c5': [20, 21, 22, 23, 24],
'c6': [25, 26, 27, 28, 29] },
index = ['r1', 'r2', 'r3', 'r4', 'r5'])
# c1 c2 c3 c4 c5 c6
# r1 0 5 10 15 20 25
# r2 1 6 11 16 21 26
# r3 2 7 12 17 22 27
# r4 3 8 13 18 23 28
# r5 4 9 14 19 24 29
Slicing Columns: these examples select all rows and one or more columns.
Slice a range of columns with a slice of column labels:
dfi_slice = dfi.loc[:, 'c1': 'c3']
# c1 c2 c3
# r1 0 5 10
# r2 1 6 11
# r3 2 7 12
# r4 3 8 13
# r5 4 9 14
Note the slice upper bound is inclusive!
Slice a single column Series with a string column label:
dfi_slice = dfi.loc[:, 'c3']
# r1 10
# r2 11
# r3 12
# r4 13
# r5 14
# Name: c3, dtype: int64
Slice a selection of columns with a tuple of column labels:
dfi_slice = dfi.loc[:, ['c2', 'c3']]
# c2 c3
# r1 5 10
# r3 7 12
# r4 8 13
# r5 9 14
However, as of pandas 1.0, passing a list (or list-like) to .loc[] sometimes raises an error indicating this is no longer supported. The preferred approach is to use .reindex():
dfi_slice = dfi.reindex(['c2', 'c3'], axis=1)
axis=1 indicates we are working on columns, not rows. You would leave off this argument when working with row labels. Slicing Rows: these examples select one or more rows and all columns.
Slice a range of rows with a slice of row labels:
dfi_slice = dfi.loc['r1': 'r3':, :]
# c1 c2 c3 c4 c5 c6
# r1 0 5 10 15 20 25
# r2 1 6 11 16 21 26
# r3 2 7 12 17 22 27
Note the slice upper bound is inclusive!
Slice a single row Series with a string row label:
dfi_slice = dfi.loc['r2', :]
# c1 1
# c2 6
# c3 11
# c4 16
# c5 21
# c6 26
# Name: r2, dtype: int64
Slice a selection of rows with a tuple of row labels:
dfi_slice = dfi.loc[('r1', 'r3', 'r5'), :]
# c1 c2 c3 c4 c5 c6
# r1 0 5 10 15 20 25
# r3 2 7 12 17 22 27
# r5 4 9 14 19 24 29
Slicing Rows and Columns
We can of course specify both rows and columns:
dfi.loc['r1': 'r3', 'c1': 'c3']
# c1 c2 c3
# r1 0 5 10
# r2 1 6 11
# r3 2 7 12
A conditional can be used with .loc[] to select rows or columns
Again, starting with this DataFrame:
dfi = pd.DataFrame({'c1': [0, 1, 2, 3, 4],
'c2': [5, 6, 7, 8, 9],
'c3': [10, 11, 12, 13, 14],
'c4': [15, 16, 17, 18, 19],
'c5': [20, 21, 22, 23, 24],
'c6': [25, 26, 27, 28, 29] },
index = ['r1', 'r2', 'r3', 'r4', 'r5'])
# c1 c2 c3 c4 c5 c6
# r1 0 5 10 15 20 25
# r2 1 6 11 16 21 26
# r3 2 7 12 17 22 27
# r4 3 8 13 18 23 28
# r5 4 9 14 19 24 29
.loc[] can also specify rows or columns based on criteria -- here are all the rows with 'c3' value greater than 11 (and all columns):
dfislice = dfi.loc[ dfi['c3'] > 11, :]
# c1 c2 c3 c4 c5 c6
# r3 2 7 12 17 22 27
# r4 3 8 13 18 23 28
# r5 4 9 14 19 24 29
In order to add or change column values based on a row boolean test, we can specify which column should change and assign a value to it:
dfi.loc[ dfi['c3'] > 11, 'c6'] = dfi['c6'] * 100 # 100 * 'c6' value if 'c3' > 11
print(dfi)
# c1 c2 c3 c4 c5 c6
# r1 0 5 10 15 20 25
# r2 1 6 11 16 21 26
# r3 2 7 12 17 22 2700
# r4 3 8 13 18 23 2800
# r5 4 9 14 19 24 2900
pd.concat() is analogous to df.append()
concat() can join dataframes either horizontally or vertically.
df = pd.DataFrame( {'a': [1, 2, ],
'b': [1.0, 1.5 ] } )
df2 = pd.DataFrame( {'b': [1, 2 ],
'c': [1.0, 1.5 ] } )
df3 = pd.concat([df, df2])
print(df3)
# a b c
# 0 1.0 1.0 NaN
# 1 2.0 1.5 NaN
# 0 NaN 1.0 1.0
# 1 NaN 2.0 1.5
Note that the column labels have been aligned. As a result, some data is seen to be "missing", with the NaN value used (discussed shortly).
In horizontal concatenation, the row labels are aligned but the column labels may be repeated:
df4 = pd.concat([df, df2], axis=1)
print(df4)
# a b b c
# 0 1 1.0 1 1.0
# 1 2 1.5 2 1.5
DataFrame append() is the method equivalent to pd.concat(), called on a DataFrame:
df = df.append(df2) # compare: pd.concat([df, df2])
df = df3.append(df4, axis=1) # compare: pd.concat([df, df2], axis=1)
We can append a Series but must include the ignore_index=True parameter:
df = pd.DataFrame( {'a': [1, 2, ],
'b': [1.0, 1.5 ] } )
df = df.append(pd.Series(), ignore_index=True)
print(df)
# a b
# 0 1.0 1.0
# 1 2.0 1.5
# 2 NaN NaN
merge() provides database-like joins.
Merge performs a relational database-like join on two dataframes. We can join on a particular field and the other fields will align accordingly.
companies = pd.read_excel('company_states.xlsx', sheetname='Companies')
states = pd.read_excel('company_states.xlsx', sheetname='States')
print(companies)
# Company State
# 0 Microsoft WA
# 1 Apple CA
# 2 IBM NY
# 3 PRTech PR
print(states)
# State Abbrev State Long
# 0 AZ Arizona
# 1 CA California
# 2 CO Colorado
# 3 NY New York
# 4 WA Washington
cs = pd.merge(companies, states,
left_on='State', right_on='State Abbrev')
print(cs)
# Company State State Abbrev State Long
# 0 Microsoft WA WA Washington
# 1 Apple CA CA California
# 2 IBM NY NY New York
# 3 PRTech PR NaN NaN
When we merge, you can choose to join on the index (default), or one or more columns. The choices are similar to that in relationship databases:
Merge method SQL Join Name Description left LEFT OUTER JOIN Use keys from left frame only right RIGHT OUTER JOIN Use keys from right frame only outer FULL OUTER JOIN Use union of keys from both frames inner INNER JOIN Use intersection of keys from both frames
how= describes the type of join on= designates the column on which to join If the join columns are differently named, we can use left_on= and right_on=
left join: include only keys from 'left' dataframe. Note that only states from the 'companies' dataframe are included.
cs = pd.merge(companies, states, how='left',
left_on='State', right_on='State Abbrev')
print(cs)
# Company State State Abbrev State Long
# 0 Microsoft WA WA Washington
# 1 Apple CA CA California
# 2 IBM NY NY New York
(Right join would be the same but with the dfs switched.)
outer join: include keys from both dataframes. Note that all states are included, and the missing data from 'companies' is shown as NaN
cs = pd.merge(companies, states, how='outer',
left_on='State', right_on='State Abbrev')
print(cs)
# Company State State Abbrev State Long
# 0 Microsoft WA WA Washington
# 1 Apple CA CA California
# 2 IBM NY NY New York
# 3 PRTech PR NaN NaN
# 4 NaN NaN AZ Arizona
# 5 NaN NaN CO Colorado
inner join: include only keys common to both dataframes. Note taht
cs = pd.merge(companies, states, how='inner',
left_on='State', right_on='State Abbrev')
print(cs)
# Company State State Abbrev State Long
# 0 Microsoft WA WA Washington
# 1 Apple CA CA California
# 2 IBM NY NY New York