Python 3

home

pandas: Transforming, Sorting and Cleaning

Vectorized Operations

Operations to columns are vectorized, meaning they are propagated (broadcast) across all column Series in a DataFrame.

import pandas as pd

df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [1.0, 1.5, 2.0, 2.5],
                    'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )

print(df)

                  #     a    b  c
                  # r1  1  1.0  a
                  # r2  2  1.5  b
                  # r3  3  2.0  c
                  # r4  4  2.5  d


# 'single value':  assign the same value to all cells in a column Series
df['a'] = 0       # set all 'a' values to 0
print(df)

                  #     a    b  c
                  # r1  0  1.0  a
                  # r2  0  1.5  b
                  # r3  0  2.0  c
                  # r4  0  2.5  d


# 'calculation':  compute a new value for all cells in a column Series
df['b'] = df['b'] * 2    # double all column 'b' values

print(df)

                  #     a    b  c
                  # r1  0  2.0  a
                  # r2  0  3.0  b
                  # r3  0  4.0  c
                  # r4  0  5.0  d

Adding New Columns with Vectorized Values

We can also add a new column to the Dataframe based on values or computations:

df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [2.0, 3.0, 4.0, 5.0],
                    'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )


df['d'] = 3.14           # new column, each field set to same value

print(df)

                  #     a    b  c  d
                  # r1  1  2.0  a  3.14
                  # r2  2  3.0  b  3.14
                  # r3  3  4.0  c  3.14
                  # r4  4  5.0  d  3.14


df['e'] = df['a'] + df['b']    # vectorized computation to new column

print(df)

                  #     a    b  c     d  e
                  # r1  1  2.0  a  3.14  3.0
                  # r2  2  3.0  b  3.14  5.0
                  # r3  3  4.0  c  3.14  7.0
                  # r4  4  5.0  d  3.14  9.0

Aggregate methods for DataFrame and Series

Methods .sum(), .cumsum(), .count(), .min(), .max(), .mean(), .median(), et al. provide summary operations

import numpy as np
df = pd.DataFrame( {'a': [1, 2, 3, 4],
                    'b': [1.0, 1.5, np.nan, 2.5],
                    'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )

print(df.sum())

     # a      10
     # b       5
     # c    abba
     # dtype: object


print(df.cumsum())

     #      a    b     c
     # r1   1    1     a
     # r2   3  2.5    ab
     # r3   6  NaN   abb
     # r4  10    5  abba


print(df.count())

     # a    4
     # b    3
     # c    4
     # dtype: int64

Most of these methods work on a Series object as well:

print(df['a'].median())

2.5

To see a list of attributes for any object, use dir() with a DataFrame or Series object. This is best done in jupyter notebook:

dir(df['a'])     # attributes for Series

The list of attributes is long, but this kind of exploration can provide some useful surprises.

DataFrame groupby()

A groupby operation performs the same type of operation as the database GROUP BY. Grouping rows of the table by the value in a particular column, you can perform aggregate sums, counts or custom aggregations.

This simple hypothetical table shows client names, regions, revenue values and type of revenue.

df = pd.DataFrame( { 'company': ['Alpha', 'ALPHA', 'ALPHA', 'BETA', 'Beta', 'Beta', 'Gamma', 'Gamma', 'Gamma'],
                     'region':  ['NE', 'NW', 'SW', 'NW', 'SW', 'NE', 'NE', 'SW', 'NW'],
                     'revenue': [10, 9, 2, 15, 8, 2, 16, 3, 9],
                     'revtype': ['retail', 'retail', 'wholesale', 'wholesale', 'wholesale',
                                 'retail', 'wholesale', 'retail', 'retail'] } )

print(df)

                  #   company region  revenue    revtype
                  # 0   Alpha     NE       10     retail
                  # 1   Alpha     NW        9     retail
                  # 2   Alpha     SW        2  wholesale
                  # 3    Beta     NW       15  wholesale
                  # 4    Beta     SW        8  wholesale
                  # 5    Beta     NE        2     retail
                  # 6   Gamma     NE       16  wholesale
                  # 7   Gamma     SW        3     retail
                  # 8   Gamma     NW        9     retail

groupby() built-in Aggregate Functions

The "summary functions" like sum() count()

Aggregations are provided by the DataFrame groupby() method, which returns a special groupby object. If we'd like to see revenue aggregated by region, we can simply select the column to aggregate and call an aggregation function on this object:

# revenue sum by region
rsbyr = df.groupby('region').sum()   # call sum() on the groupby object
print(rsbyr)

                  #         revenue
                  # region
                  # NE           28
                  # NW           33
                  # SW           13


# revenue average by region
rabyr = df.groupby('region').mean()
print(rabyr)

                  #           revenue
                  # region
                  # NE       9.333333
                  # NW      11.000000
                  # SW       4.333333

The result is dataframe with the 'region' as the index and 'revenue' as the sole column. Note that although we didn't specify the revenue column, pandas noticed that the other columns were not numbers and therefore should not be included in a sum or mean. If we ask for a count, python counts each column (which will be the same for each). So if we'd like the analysis to be limited to one or more coluns, we can simply slice the dataframe first:

# count of all columns by region
print(df.groupby('region').count())

                  #         company  revenue  revtype
                  # region
                  # NE            3        3        3
                  # NW            3        3        3
                  # SW            3        3        3


# count of companies by region
dfcr = df[['company', 'region']]       # dataframe slice:  only 'company' and 'region'
print(dfcr.groupby('region').count())

                  #         company
                  # region
                  # NE            3
                  # NW            3
                  # SW            3

Multi-column aggregation To aggregate by values in two combined columns, simply pass a list of columns by which to aggregate -- the result is called a "multi-column aggregation":

print(df.groupby(['region', 'revtype']).sum())

                  #                   revenue
                  # region revtype
                  # NE     retail          12
                  #        wholesale       16
                  # NW     retail          18
                  #        wholesale       15
                  # SW     retail           3
                  #        wholesale       10

List of selected built-in groupby functions

               count()
               mean()
               sum()
               min()
               max()
               describe() (prints out several columns including sum, mean, min, max)

DataFrame Sorting and Transposing

Reorder and rotate a DataFrame

import random
rdf = pd.DataFrame({'a': [ random.randint(1,5) for i in range(5)],
                    'b': [ random.randint(1,5) for i in range(5)],
                    'c': [ random.randint(1,5) for i in range(5)]})
print(rdf)

    #    a  b  c
    # 0  2  1  4
    # 1  5  3  3
    # 2  1  2  4
    # 3  5  2  4
    # 4  2  4  4

# sorting by a column
rdf = rdf.sort_values('a')
print(rdf)

    #    a  b  c
    # 2  1  2  4
    # 0  2  1  4
    # 4  2  4  4
    # 1  5  3  3
    # 3  5  2  4

# sorting by a row
idf = rdf.sort_values(3, axis=1)
print(idf)

    #    b  c  a
    # 0  1  4  2
    # 1  3  3  5
    # 2  2  4  1
    # 3  2  4  5
    # 4  4  4  2

# sorting values by two columns (first by 'c', then by 'b')
rdf = rdf.sort_values(['c', 'b'])
print(rdf)

    #    a  b  c
    # 1  5  3  3
    # 0  2  1  4
    # 2  1  2  4
    # 3  5  2  4
    # 4  2  4  4

# sorting by index
rdf = rdf.sort_index()
print(rdf)

    #    a  b  c
    # 0  2  1  4
    # 1  5  3  3
    # 2  1  2  4
    # 3  5  2  4
    # 4  2  4  4

# sorting options:  ascending=False, axis=1

Transposing

Transposing simply means inverting the x and y axes -- in a sense, flipping the values diagonally:

rdft = rdf.T
print(rdft)

    #    0  1  2  3  4
    # a  2  5  1  5  2
    # b  1  3  2  2  4
    # c  4  3  4  4  4

Working with Missing Data (NaN)

"Not a Number" is numpy's None value.

If pandas can't insert a value (because indexes are misaligned or for other reasons), it inserts a special value noted as NaN (not a number) in its place. This value belongs to the numpy module, accessed through np.nan

import numpy as np

df = pd.DataFrame({ 'c1': [6, 6, np.nan],
                    'c2': [np.nan, 1, 3],
                    'c3': [2, 2, 2]  })

print(df)
                  #     c1   c2  c3
                  # 0  6.0  NaN   2
                  # 1  6.0  1.0   2
                  # 2  NaN  3.0   2

Note that we are specifying the NaN value with np.nan, athough in most cases the value is generated by "holes" in mismatched data.

We can fill missing data with fillna():

df2 = df.fillna(0)
print(df2)
                  #     c1   c2  c3
                  # 0  6.0  0.0   2
                  # 1  6.0  1.0   2
                  # 2  0.0  3.0   2

Or we can choose to drop rows or columns that have any NaN values with dropna():

df3 = df.dropna()

print(df3)
                  #     c1   c2  c3
                  # 1  6.0  1.0   2

# axis=1:  drop columns
df4 = df.dropna(axis=1)

print(df4)

                  #    c3
                  # 0   2
                  # 1   2
                  # 2   2

Testing for NaN We may well be interested in whether a column or row has missing data. .isnull() provides a True/False mapping.

print(df)
                  #     c1   c2  c3
                  # 0  6.0  NaN   2
                  # 1  6.0  1.0   2
                  # 2  NaN  3.0   2

df['c1'].isnull().any()  # True
df['c3'].isnull().any()  # False


df['c1'].isnull().all()  # False

[pr]