Python 3home |
Operations to columns are vectorized, meaning they are propagated (broadcast) across all column Series in a DataFrame.
import pandas as pd
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
# r4 4 2.5 d
# 'single value': assign the same value to all cells in a column Series
df['a'] = 0 # set all 'a' values to 0
print(df)
# a b c
# r1 0 1.0 a
# r2 0 1.5 b
# r3 0 2.0 c
# r4 0 2.5 d
# 'calculation': compute a new value for all cells in a column Series
df['b'] = df['b'] * 2 # double all column 'b' values
print(df)
# a b c
# r1 0 2.0 a
# r2 0 3.0 b
# r3 0 4.0 c
# r4 0 5.0 d
We can also add a new column to the Dataframe based on values or computations:
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [2.0, 3.0, 4.0, 5.0],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
df['d'] = 3.14 # new column, each field set to same value
print(df)
# a b c d
# r1 1 2.0 a 3.14
# r2 2 3.0 b 3.14
# r3 3 4.0 c 3.14
# r4 4 5.0 d 3.14
df['e'] = df['a'] + df['b'] # vectorized computation to new column
print(df)
# a b c d e
# r1 1 2.0 a 3.14 3.0
# r2 2 3.0 b 3.14 5.0
# r3 3 4.0 c 3.14 7.0
# r4 4 5.0 d 3.14 9.0
Methods .sum(), .cumsum(), .count(), .min(), .max(), .mean(), .median(), et al. provide summary operations
import numpy as np
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, np.nan, 2.5],
'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(df.sum())
# a 10
# b 5
# c abba
# dtype: object
print(df.cumsum())
# a b c
# r1 1 1 a
# r2 3 2.5 ab
# r3 6 NaN abb
# r4 10 5 abba
print(df.count())
# a 4
# b 3
# c 4
# dtype: int64
Most of these methods work on a Series object as well:
print(df['a'].median())
2.5
To see a list of attributes for any object, use dir() with a DataFrame or Series object. This is best done in jupyter notebook:
dir(df['a']) # attributes for Series
The list of attributes is long, but this kind of exploration can provide some useful surprises.
A groupby operation performs the same type of operation as the database GROUP BY. Grouping rows of the table by the value in a particular column, you can perform aggregate sums, counts or custom aggregations.
This simple hypothetical table shows client names, regions, revenue values and type of revenue.
df = pd.DataFrame( { 'company': ['Alpha', 'ALPHA', 'ALPHA', 'BETA', 'Beta', 'Beta', 'Gamma', 'Gamma', 'Gamma'],
'region': ['NE', 'NW', 'SW', 'NW', 'SW', 'NE', 'NE', 'SW', 'NW'],
'revenue': [10, 9, 2, 15, 8, 2, 16, 3, 9],
'revtype': ['retail', 'retail', 'wholesale', 'wholesale', 'wholesale',
'retail', 'wholesale', 'retail', 'retail'] } )
print(df)
# company region revenue revtype
# 0 Alpha NE 10 retail
# 1 Alpha NW 9 retail
# 2 Alpha SW 2 wholesale
# 3 Beta NW 15 wholesale
# 4 Beta SW 8 wholesale
# 5 Beta NE 2 retail
# 6 Gamma NE 16 wholesale
# 7 Gamma SW 3 retail
# 8 Gamma NW 9 retail
The "summary functions" like sum() count()
Aggregations are provided by the DataFrame groupby() method, which returns a special groupby object. If we'd like to see revenue aggregated by region, we can simply select the column to aggregate and call an aggregation function on this object:
# revenue sum by region
rsbyr = df.groupby('region').sum() # call sum() on the groupby object
print(rsbyr)
# revenue
# region
# NE 28
# NW 33
# SW 13
# revenue average by region
rabyr = df.groupby('region').mean()
print(rabyr)
# revenue
# region
# NE 9.333333
# NW 11.000000
# SW 4.333333
The result is dataframe with the 'region' as the index and 'revenue' as the sole column. Note that although we didn't specify the revenue column, pandas noticed that the other columns were not numbers and therefore should not be included in a sum or mean. If we ask for a count, python counts each column (which will be the same for each). So if we'd like the analysis to be limited to one or more coluns, we can simply slice the dataframe first:
# count of all columns by region
print(df.groupby('region').count())
# company revenue revtype
# region
# NE 3 3 3
# NW 3 3 3
# SW 3 3 3
# count of companies by region
dfcr = df[['company', 'region']] # dataframe slice: only 'company' and 'region'
print(dfcr.groupby('region').count())
# company
# region
# NE 3
# NW 3
# SW 3
Multi-column aggregation To aggregate by values in two combined columns, simply pass a list of columns by which to aggregate -- the result is called a "multi-column aggregation":
print(df.groupby(['region', 'revtype']).sum())
# revenue
# region revtype
# NE retail 12
# wholesale 16
# NW retail 18
# wholesale 15
# SW retail 3
# wholesale 10
List of selected built-in groupby functions
count() mean() sum() min() max() describe() (prints out several columns including sum, mean, min, max)
Reorder and rotate a DataFrame
import random
rdf = pd.DataFrame({'a': [ random.randint(1,5) for i in range(5)],
'b': [ random.randint(1,5) for i in range(5)],
'c': [ random.randint(1,5) for i in range(5)]})
print(rdf)
# a b c
# 0 2 1 4
# 1 5 3 3
# 2 1 2 4
# 3 5 2 4
# 4 2 4 4
# sorting by a column
rdf = rdf.sort_values('a')
print(rdf)
# a b c
# 2 1 2 4
# 0 2 1 4
# 4 2 4 4
# 1 5 3 3
# 3 5 2 4
# sorting by a row
idf = rdf.sort_values(3, axis=1)
print(idf)
# b c a
# 0 1 4 2
# 1 3 3 5
# 2 2 4 1
# 3 2 4 5
# 4 4 4 2
# sorting values by two columns (first by 'c', then by 'b')
rdf = rdf.sort_values(['c', 'b'])
print(rdf)
# a b c
# 1 5 3 3
# 0 2 1 4
# 2 1 2 4
# 3 5 2 4
# 4 2 4 4
# sorting by index
rdf = rdf.sort_index()
print(rdf)
# a b c
# 0 2 1 4
# 1 5 3 3
# 2 1 2 4
# 3 5 2 4
# 4 2 4 4
# sorting options: ascending=False, axis=1
Transposing simply means inverting the x and y axes -- in a sense, flipping the values diagonally:
rdft = rdf.T
print(rdft)
# 0 1 2 3 4
# a 2 5 1 5 2
# b 1 3 2 2 4
# c 4 3 4 4 4
"Not a Number" is numpy's None value.
If pandas can't insert a value (because indexes are misaligned or for other reasons), it inserts a special value noted as NaN (not a number) in its place. This value belongs to the numpy module, accessed through np.nan
import numpy as np
df = pd.DataFrame({ 'c1': [6, 6, np.nan],
'c2': [np.nan, 1, 3],
'c3': [2, 2, 2] })
print(df)
# c1 c2 c3
# 0 6.0 NaN 2
# 1 6.0 1.0 2
# 2 NaN 3.0 2
Note that we are specifying the NaN value with np.nan, athough in most cases the value is generated by "holes" in mismatched data.
We can fill missing data with fillna():
df2 = df.fillna(0)
print(df2)
# c1 c2 c3
# 0 6.0 0.0 2
# 1 6.0 1.0 2
# 2 0.0 3.0 2
Or we can choose to drop rows or columns that have any NaN values with dropna():
df3 = df.dropna()
print(df3)
# c1 c2 c3
# 1 6.0 1.0 2
# axis=1: drop columns
df4 = df.dropna(axis=1)
print(df4)
# c3
# 0 2
# 1 2
# 2 2
Testing for NaN We may well be interested in whether a column or row has missing data. .isnull() provides a True/False mapping.
print(df)
# c1 c2 c3
# 0 6.0 NaN 2
# 1 6.0 1.0 2
# 2 NaN 3.0 2
df['c1'].isnull().any() # True
df['c3'].isnull().any() # False
df['c1'].isnull().all() # False