Python 3

home

pandas: Subscripting, Slicing, Joining, Appending

Selecting a Series from a DataFrame

Use a subscript (or attribute) to access columns by label; use the .loc[] or .iloc[] attributes to access rows by label or integer index.

a DataFrame:

df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [1.0, 1.5, 2.0, 2.5],
                    'c': ['a', 'b', 'c', 'd'],
                    'd': [100, 200, 300, 400] },
                    index=['r1', 'r2', 'r3', 'r4'] )

access column as Series:

cola = df['a']       # Series with [1, 2, 3, 4] and index ['r1', 'r2', 'r3', 'r4']

cola = df.a          # same -- can often use attribute labels for column name

print(cola)

            # r1    1
            # r2    2
            # r3    3
            # r4    4
            # Name: a, dtype: int64

access row as Series using index label 'r2':

row2 = df.loc['r2']  # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd']

access row as Series using integer index:

row2 = df.iloc[1]    # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd'] (same as above)

print(row2)

            # a      2
            # b    1.5
            # c      b
            # d    200
            # Name: r2, dtype: object

(Note that the .ix DataFrame indexer is a legacy feature and is deprecated.)

slicing

DataFrames can be sliced along a column or row (Series) or both (DataFrame)

Access a Series object through DataFrame column or index labels Again, we can apply any Series operation on any of the Series within a DataFrame - slice, access by Index, etc.

dfi = pd.DataFrame( {'a': [1, 2, 3, 4],
                     'b': [1.0, 1.5, 2.0, 2.5],
                     'c': ['a', 'b', 'c', 'd'],
                     'd': [100, 200, 300, 400] },
                     index=['r1', 'r2', 'r3', 'r4'] )

print(dfi['b'])
    # r1     1.0
    # r2     1.5
    # r3     2.0
    # r4     2.5
    # Name: b

    # print(df['b'][0:3])
    # r1     1.0
    # r2     1.5
    # r3     2.0

    # dfi['b']['r2']
    # 1.5

Create a DataFrame from columns of another DataFrame Oftentimes we want to eliminate one or more columns from our DataFrame. We do this by slicing Series out of the DataFrame, to produce a new DataFrame:

>>> dfi[['a', 'c']]
     a   c
r1   1   a
r2   2   b
r3   3   c
r4   4   d

Far less often we may want to isolate a row from a DataFrame - this is also returned to us as a Series. Note the column labels have become the Series index, and the row label becomes the Series Name. 2-dimensional slicing A double subscript can select a 2-dimensional slice (some rows and some columns).

df[['a', 'b']]['alpha': 'gamma']

Also note carefully the list inside the first square brackets.

Conditional Slicing with Boolean Test

Oftentimes we want to select rows based on row criteria (i.e., conditionally). To do this, we establish a boolean test, placed within subscript-like square brackets.

Selecting rows based on column criteria:

import pandas as pd

df = pd.DataFrame( { 'a': [1, 2, 3, 4],
                     'b': [-1.0, -1.5, 2.0, 2.5],
                     'c': ['a', 'b', 'c', 'd']  }, index=['r1', 'r2', 'r3', 'r4'] )

print(df)

                    #     a    b  c
                    # r1  1 -1.0  a
                    # r2  2 -1.5  b
                    # r3  3  2.0  c
                    # r4  4  2.5  d

print(df[ df['b'] < 0 ])              # select rows where 'b' value is < 0

                    #     a    b  c
                    # r1  1 -1.0  a
                    # r2  2 -1.5  b

The boolean test by itself returns a boolean Series. Its values indicate whether the test return True for the value in the tested Series. This test can of course be assigned to a name and used by name, which is common for complex criteria:

b_series = df['a'] > 2
print(b_series)          # we are printing this just for illustration

                    # r1    False
                    # r2    False
                    # r3    True
                    # r4    True
                    # Name: a, dtype: bool


print(df[ b_series ])
                    #     a    b  c
                    # r3  3  2.0  c
                    # r4  4  2.5  d

negating a boolean test

a tilde (~) in front of a boolean test creates its inverse:

b_test = df['a'] > 2
print(df[ ~b_test ])

                  #     a    b  c
                  # r1  1 -1.0  a
                  # r2  2 -1.5  b

compound tests use & for 'and', | for 'or', and ( ) to separate tests

The parentheses are needed to disambiguate the parts of the compound test.

print(df[ (df.a > 3) & (df.a < 5) ])

                  #     a    b  c
                  # r4  4  2.5  d

The 'copy of a slice' warning

An enigmatic warning that has bedeviled pandas coders.

We often begin work by reading a large dataset into a DataFrame, then slicing out a meaningful subset (eliminating columns and rows that are irrelevant to our analysis). Then we may wish to make some changes to the slice, or add columns to the slice. A recurrent problem in working with slices is that standard slicing may produce a link into the original data, or it may produce a temporary "copy". If a change is made to a temporary copy, our working data will not be changed.

Here we are creating a slice by using a double subscript:

dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                   'c2': [5,   6,  7, 8,   9],
                   'c3': [10, 11, 12, 13, 14],
                   'c4': [15, 16, 17, 18, 19],
                   'c5': [20, 21, 22, 23, 24],
                   'c6': [25, 26, 27, 28, 29] },
           index = ['r1', 'r2', 'r3', 'r4', 'r5'])

dfi_prime = dfi[ dfi.c1 > 2 ]
print(dfi_prime)

                  #     c1  c2  c3  c4  c5  c6
                  # r4   3   8  13  18  23  28
                  # r5   4   9  14  19  24  29

dfi_prime.c3 = dfi_prime.c1 * dfi_prime.c2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Note that in some cases this warning will not appear, and in others the warning will appear and yet the change will have taken effect.

The same problem may occur with a simple slice selection:

myslice = dfi[ ['c1', 'c2', 'c3'] ]
print myslice

                  #     c1  c2  c3  c4  c5  c6
                  # r3   2   7  12  17  22  27
                  # r4   3   8  13  18  23  28
                  # r5   4   9  14  19  24  29

myslice.c3 = myslice.c1 * myslice.c2

print(myslice)

The problem here is that pandas cannot guarantee whether the slice is a view on the original data, or a temporary copy! If a temporary copy, a change will not take effect! What's particularly problematic about this warning is that we may not always see it in these situations. We may also see false positives and false negatives, as is acknowledged in the documentation.

Avoiding the 'copy of a slice' warning

Set the "Copy on Write" configuration, or use .loc[]

There are two solutions to the 'copy of a slice' warning (and its attendant uncertainty).

The recommended solution is to enable the "Copy on Write" functionality, which causes all write operations on slices to work on copies, so the underlying data does not change:

pd.options.mode.copy_on_write = True

myslice.c3 = myslice.c1 * myslice.c2

print(myslice)

Another solution is to use .loc or .iloc:

filtered = dfi.loc[ dfi.c3 > 11, : ]     # filter by column, include all rows

myslice.c3 = filtered.c1 * filtered.c2

print(myslice)

Keep in mind that you may get a warning even with this approach; you can consider it a false positive (i.e., disregard it). More details about .loc are in the next section. It's important to note that Copy on Write will become the default behavior starting with pandas 3.0.

Using .loc[] to select data by column or row label

If a slice is to changed, it should be derived using .loc[] rather than slicing.

Again, starting with this DataFrame:

dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                   'c2': [5,   6,  7, 8,   9],
                   'c3': [10, 11, 12, 13, 14],
                   'c4': [15, 16, 17, 18, 19],
                   'c5': [20, 21, 22, 23, 24],
                   'c6': [25, 26, 27, 28, 29] },
           index = ['r1', 'r2', 'r3', 'r4', 'r5'])

    #     c1  c2  c3  c4  c5  c6
    # r1   0   5  10  15  20  25
    # r2   1   6  11  16  21  26
    # r3   2   7  12  17  22  27
    # r4   3   8  13  18  23  28
    # r5   4   9  14  19  24  29

Slicing Columns: these examples select all rows and one or more columns.

Slice a range of columns with a slice of column labels:

dfi_slice = dfi.loc[:, 'c1': 'c3']

    #     c1  c2  c3
    # r1   0   5  10
    # r2   1   6  11
    # r3   2   7  12
    # r4   3   8  13
    # r5   4   9  14

Note the slice upper bound is inclusive!

Slice a single column Series with a string column label:

dfi_slice = dfi.loc[:, 'c3']

    # r1    10
    # r2    11
    # r3    12
    # r4    13
    # r5    14
    # Name: c3, dtype: int64

Slice a selection of columns with a tuple of column labels:

dfi_slice = dfi.loc[:, ['c2', 'c3']]

    #    c2  c3
    # r1   5  10
    # r3   7  12
    # r4   8  13
    # r5   9  14

However, as of pandas 1.0, passing a list (or list-like) to .loc[] sometimes raises an error indicating this is no longer supported. The preferred approach is to use .reindex():

dfi_slice = dfi.reindex(['c2', 'c3'], axis=1)

axis=1 indicates we are working on columns, not rows. You would leave off this argument when working with row labels. Slicing Rows: these examples select one or more rows and all columns.

Slice a range of rows with a slice of row labels:

dfi_slice = dfi.loc['r1': 'r3':, :]

    #     c1  c2  c3  c4  c5  c6
    # r1   0   5  10  15  20  25
    # r2   1   6  11  16  21  26
    # r3   2   7  12  17  22  27

Note the slice upper bound is inclusive!

Slice a single row Series with a string row label:

dfi_slice = dfi.loc['r2', :]

    # c1     1
    # c2     6
    # c3    11
    # c4    16
    # c5    21
    # c6    26
    # Name: r2, dtype: int64

Slice a selection of rows with a tuple of row labels:

dfi_slice = dfi.loc[('r1', 'r3', 'r5'), :]

    #     c1  c2  c3  c4  c5  c6
    # r1   0   5  10  15  20  25
    # r3   2   7  12  17  22  27
    # r5   4   9  14  19  24  29

Slicing Rows and Columns

We can of course specify both rows and columns:

dfi.loc['r1': 'r3', 'c1': 'c3']

    #     c1  c2  c3
    # r1   0   5  10
    # r2   1   6  11
    # r3   2   7  12

Using .loc[] to select data by boolean test criteria

A conditional can be used with .loc[] to select rows or columns

Again, starting with this DataFrame:

dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                   'c2': [5,   6,  7, 8,   9],
                   'c3': [10, 11, 12, 13, 14],
                   'c4': [15, 16, 17, 18, 19],
                   'c5': [20, 21, 22, 23, 24],
                   'c6': [25, 26, 27, 28, 29] },
           index = ['r1', 'r2', 'r3', 'r4', 'r5'])

    #     c1  c2  c3  c4  c5  c6
    # r1   0   5  10  15  20  25
    # r2   1   6  11  16  21  26
    # r3   2   7  12  17  22  27
    # r4   3   8  13  18  23  28
    # r5   4   9  14  19  24  29

.loc[] can also specify rows or columns based on criteria -- here are all the rows with 'c3' value greater than 11 (and all columns):

dfislice = dfi.loc[ dfi['c3'] > 11, :]

    #     c1  c2  c3  c4  c5  c6
    # r3   2   7  12  17  22  27
    # r4   3   8  13  18  23  28
    # r5   4   9  14  19  24  29

In order to add or change column values based on a row boolean test, we can specify which column should change and assign a value to it:

dfi.loc[ dfi['c3'] > 11, 'c6'] = dfi['c6'] * 100  # 100 * 'c6' value if 'c3' > 11
print(dfi)

   #     c1  c2  c3  c4  c5    c6
   # r1   0   5  10  15  20    25
   # r2   1   6  11  16  21    26
   # r3   2   7  12  17  22  2700
   # r4   3   8  13  18  23  2800
   # r5   4   9  14  19  24  2900

DataFrame Concatenating / Appending

pd.concat() is analogous to df.append()

concat() can join dataframes either horizontally or vertically.

df = pd.DataFrame( {'a': [1, 2, ],
                    'b': [1.0, 1.5 ] } )

df2 = pd.DataFrame( {'b': [1, 2 ],
                     'c': [1.0, 1.5 ] } )
df3 = pd.concat([df, df2])
print(df3)

      #      a    b    c
      # 0  1.0  1.0  NaN
      # 1  2.0  1.5  NaN
      # 0  NaN  1.0  1.0
      # 1  NaN  2.0  1.5

Note that the column labels have been aligned. As a result, some data is seen to be "missing", with the NaN value used (discussed shortly).

In horizontal concatenation, the row labels are aligned but the column labels may be repeated:

df4 = pd.concat([df, df2], axis=1)
print(df4)

      #    a    b  b    c
      # 0  1  1.0  1  1.0
      # 1  2  1.5  2  1.5

DataFrame append() is the method equivalent to pd.concat(), called on a DataFrame:

df = df.append(df2)            # compare:  pd.concat([df, df2])

df = df3.append(df4, axis=1)   # compare:  pd.concat([df, df2], axis=1)

We can append a Series but must include the ignore_index=True parameter:

df = pd.DataFrame( {'a': [1, 2, ],
                    'b': [1.0, 1.5 ] } )

df = df.append(pd.Series(), ignore_index=True)

print(df)

       #      a    b
       # 0  1.0  1.0
       # 1  2.0  1.5
       # 2  NaN  NaN

pandas .merge() (DataFrame .join())

merge() provides database-like joins.

Merge performs a relational database-like join on two dataframes. We can join on a particular field and the other fields will align accordingly.

companies = pd.read_excel('company_states.xlsx', sheetname='Companies')
states = pd.read_excel('company_states.xlsx', sheetname='States')

print(companies)
     #      Company State
     # 0  Microsoft    WA
     # 1      Apple    CA
     # 2        IBM    NY
     # 3     PRTech    PR

print(states)

     #   State Abbrev  State Long
     # 0           AZ     Arizona
     # 1           CA  California
     # 2           CO    Colorado
     # 3           NY    New York
     # 4           WA  Washington

cs = pd.merge(companies, states,
              left_on='State', right_on='State Abbrev')

print(cs)
     #      Company State State Abbrev  State Long
     # 0  Microsoft    WA           WA  Washington
     # 1      Apple    CA           CA  California
     # 2        IBM    NY           NY    New York
     # 3     PRTech    PR          NaN  NaN

When we merge, you can choose to join on the index (default), or one or more columns. The choices are similar to that in relationship databases:

Merge method  SQL Join Name     Description
left          LEFT OUTER JOIN   Use keys from left frame only
right         RIGHT OUTER JOIN  Use keys from right frame only
outer         FULL OUTER JOIN   Use union of keys from both frames
inner         INNER JOIN        Use intersection of keys from both frames

how= describes the type of join on= designates the column on which to join If the join columns are differently named, we can use left_on= and right_on=

left join: include only keys from 'left' dataframe. Note that only states from the 'companies' dataframe are included.

cs = pd.merge(companies, states, how='left',
              left_on='State', right_on='State Abbrev')

print(cs)
     #      Company State State Abbrev  State Long
     # 0  Microsoft    WA           WA  Washington
     # 1      Apple    CA           CA  California
     # 2        IBM    NY           NY    New York

(Right join would be the same but with the dfs switched.)

outer join: include keys from both dataframes. Note that all states are included, and the missing data from 'companies' is shown as NaN

cs = pd.merge(companies, states, how='outer',
              left_on='State', right_on='State Abbrev')

print(cs)
     #      Company State State Abbrev  State Long
     # 0  Microsoft    WA           WA  Washington
     # 1      Apple    CA           CA  California
     # 2        IBM    NY           NY    New York
     # 3     PRTech    PR          NaN         NaN
     # 4        NaN   NaN           AZ     Arizona
     # 5        NaN   NaN           CO    Colorado

inner join: include only keys common to both dataframes. Note taht

cs = pd.merge(companies, states, how='inner',
              left_on='State', right_on='State Abbrev')

print(cs)
     #      Company State State Abbrev  State Long
     # 0  Microsoft    WA           WA  Washington
     # 1      Apple    CA           CA  California
     # 2        IBM    NY           NY    New York

[pr]