Advanced Python
In-Class Exercises, Session 9

Notes if you are using Jupyter Notebook: to call exit() from a notebook, please use sys.exit() (requires import sys). If a strange error occurs, it may be because Jupyter retains variables from all executed cells. To reset the notebook, click 'Restart Kernel' (the circular arrow) -- this will not undo any changes made. Also note that files are located in the parent directory. This means that you must prepend ../ to each filename to find it.

Ex. 9.1 Include some useful settings.
import pandas as pd
import numpy as np

import warnings

# enable multiple outputs from jupyter cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set default number of DataFrame rows printed to 8
pd.set_option('display.max_rows', 8)

# suppress excel reader warning:  'extension is unsupported'
warnings.filterwarnings('ignore', category=UserWarning,
                                  module='openpyxl')

# set the "Copy-on-Write" functionality to avoid
# "copy of a slice" warning and behavior
pd.options.mode.copy_on_write = True
 

SERIES AND DATAFRAME DTYPES

Ex. 9.2 Review: check the .dtypes of a DataFrame and .dtype of a Series.
import pandas as pd

dol = {  'a': [1, 2, 3],
         'b': [2.9, 3.5, 4.9],
         'c': ['yourstr', 'mystr', 'theirstr']  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])
 
Ex. 9.3 Use Series .astype(). Change the 'a' column to np.float64, and 'b' column to str; check dtypes before and after the operation.
import pandas as pd
import numpy as np

dol = {  'a': [1, 2, 3],
         'b': [2.9, 3.5, 4.9],
         'c': ['yourstr', 'mystr', 'theirstr']  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])
 
Ex. 9.4 use pd.to_numeric. Change the 'a' and 'b' columns to a numeric type, and show dtypes afterwards.
import pandas as pd

dol = {  'a': ['1', '2', '3'],
         'b': ['2.9', '3.5', '4.9'],
         'c': ['yourstr', 'mystr', 'theirstr']  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])
 

AGGREGATIONS WITH df.groupby() AND pd.pivot_table()

Please run this cell to view the sales-funnel spreadsheet:

import pandas as pd

df = pd.read_excel("../sales-funnel.xlsx")

df
Ex. 9.5 Single-column aggregation (.groupby() and .pivot_table(). Show sum of 'SaleAmount' grouped by 'SalesRep'.
import pandas as pd

df = pd.read_excel("../sales-funnel.xlsx")

dfgb = df.groupby('SalesRep').SaleAmount.sum()

dfgb


dfpt = pd.pivot_table(df, index='SalesRep',
                          values='SaleAmount',
                          aggfunc='sum')

dfpt
 
Ex. 9.6 Multi-column aggregation (.groupby() and .pivot_table()). Show sums of 'SaleAmount' and 'Quantity' grouped by 'SalesRep' and by 'Product'.
import pandas as pd

df = pd.read_excel("../sales-funnel.xlsx")

dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()

dfgb


dfpt = pd.pivot_table(df, index=['SalesRep', 'Product'],
                          values=['SaleAmount', 'Quantity'], aggfunc='sum')

dfpt
 
Ex. 9.7 Further agg by secondary value (.groupby() and .pivot_table()). Perform the above pivot on just 'SalesRep', but use columns= with a list to show the results broken down by 'Product'.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

dfpt = pd.pivot_table(df, index='SalesRep',
                          aggfunc='sum',
                          values=['SaleAmount', 'Quantity'],
                          columns='Product')

dfpt

dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()

dfgb
 
Ex. 9.8 Use .stack() and .unstack() (with .groupby()). Looking at the above example, note that .groupby() produces results in "long" format, while .pivot_table() renders them in "wide" format. Use .unstack() and .stack() to move between the two.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()

dfgb

dfgbus = dfgb.unstack()

dfgbus

dfgbus.stack()

dfgbus
 
Ex. 9.9 Custom aggregation function (.groupby() and .pivot_table()). Group rows by 'SalesRep' and apply a function to each group that sums up the 'SaleAmount' column. (In other words, this replicates what .groupby().sum() does).
import pandas as pd

def price_sum(this_df):
    return this_df.SaleAmount.sum()

df = pd.read_excel("../sales-funnel.xlsx")

dfgb = df.groupby('SalesRep').apply(price_sum)

dfgb



def price_sum_series(series):
    return series.sum()

dfpt = pd.pivot_table(df, index='SalesRep',
                          values='SaleAmount',
                          aggfunc=price_sum_series)

dfpt
 
Ex. 9.10 Custom 'summary' function that refers to other columns (.groupby() only). Building on the previous solution, customize the function to sum SaleAmount but reduce 'CPU' values by 20% in the sum (because of the overhead cost of handling hardware).
import pandas as pd

def price_sum(this_df):
    this_df.loc[ this_df.Product.isin(['CPU', 'Monitor']),
                                      'SaleAmount' ] = this_df.SaleAmount * .8
    return this_df.SaleAmount.sum()

df = pd.read_excel("../sales-funnel.xlsx")

dfgb = df.groupby('SalesRep').apply(price_sum)

dfgb

Since .pivot_table() only passes the Series of the column being aggregated, a custom aggregation cannot take into account the values in other columns. (In multi-column aggregations using .pivot_table(), the aggfunc= function is applied separately to each column).

 
Ex. 9.11 Custom 'grouping' function (.groupby() only). Pass a function to groupby() that identifies 2 groups using the 'Product' column: if the value is 'CPU' or 'Monitor', return 'hardware'; otherwise, return 'software'. Count the number of rows in each group with .count().
import pandas as pd

def grouper(row_idx):
    row = df.loc[row_idx]
    if row['Product'] in ['Software', 'Maintenance']:
        return 'software'
    else:
        return 'hardware'

df = pd.read_excel("../sales-funnel.xlsx")

df.groupby(grouper).Account.count()

The function receives a row index which you can use with .loc[] to find the row being grouped. Have the function return 'Software' if the 'Product' value is 'Software' or 'Maintenance', and 'Hardware' if the Product is 'CPU'. .pivot_table() does not offer a way to customize groupings, however you can precompute a grouping key for each row in the DataFrame and then pivot on that.

 

WORKING WITH MISSING DATA

Ex. 9.12 Compare None to np.nan.
import numpy
import pandas

pynone     = None            # means "empty"
not_a_num  = np.nan          # means "indeterminate number"

print(f'None:   {type(pynone)}: {bool(pynone)}')                # <class 'NoneType'>:  False
print(f'np.nan: {type(not_a_num)}:    {bool(not_a_num)}')       # <class 'float'>: True

np.nan is a float, principally because it needs to work in math calculations (for example, summing up a column that might have empty values). np.nan is True in a boolean context because it's a non-zero float -- so following Python's rules of "truthiness" is must be True. To identify it, we'll use pandas' specific methods .isna(), etc.

 
Ex. 9.13 First, review the merged DataFrame and note the missing values. Then check to see if any empty values can be found in the 'course' column.
import pandas as pd

sgrades = pd.read_csv('../student_db_grades.csv', sep=':')
snames = pd.read_csv('../student_db_names.csv')

sj = pd.merge(sgrades, snames, on='id', how='outer')

sj
print()

# see boolean column of whether missing or not
print('sj.course.isna(): ')
sj.course.isna()
print()

# see whether there are any missing values
print('sj.course.isna().any(): ')
sj.course.isna().any()
print()

# count the number of missing values
print('sj.course.insa().count(): ')
sj.course.isna().count()
print()

# select rows with missing data
print('sj[ sj.course.isna() ]')
sj[ sj.course.isna() ]
print()
 
Ex. 9.14 Summarize the number of missing values in each column. Use .isna().sum() to count missing values and df.info() to inspect column-wise nulls.
sj.isna().sum()
sj.info()
 
Ex. 9.15 Drop rows or columns that contain missing values. Use .dropna() with axis and thresh parameters to control behavior.
print('dataframe "sj":')
sj
print()

print('sj.dropna():')
sj.dropna()            # drop rows with any NaN
print()

print('sj.dropna(axis=1)')
sj.dropna(axis=1)      # drop columns with any NaN
print()

print('sj.dropna(thresh=5):')
sj.dropna(thresh=5)    # keep rows with at least 2 non-NaN values
 
Ex. 9.16 Fill missing values using a constant, a computed value, or a method. Use .fillna() with scalar values, aggregates, or directional methods like 'ffill'.
print('df:')
df
print()

print('df.fillna(0):')
df.fillna(0)
print()

print("df.ffill()")
df.ffill()
print()

print("df['A'].fillna(  df['A'].mean()  )")
df['A'].fillna(  df['A'].mean()  )
print()
 
Ex. 9.17 Replace missing values using a general-purpose replacement method. Use .replace() to substitute np.nan or None with other values.
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['x', 'y', 'z', None]
})
print('df:')
df

print('df.replace(np.nan, 999):')
df.replace(np.nan, 999)
 
Ex. 9.18 Interpolate missing values in a numeric column. Use .interpolate() to linearly estimate missing values.
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['x', 'y', 'z', None]
})

df['A'].interpolate()
 
Ex. 9.19 Filter rows based on custom logic for missing data. Use Boolean masks with .isna() and .notna() to target specific missing value conditions.
print('df:')
df
print()

print("df[  df['A'].notna() & df['B'].isna()  ]")
df[  df['A'].notna() & df['B'].isna()  ]
print()
 
Ex. 9.20 Fill missing values within groups using the group mean. Use .groupby() combined with .transform() to fill missing data contextually.
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['x', 'y', 'z', None]
})

df_grouped = df.copy()
df_grouped['Group'] = ['G1', 'G1', 'G2', 'G2']
df_grouped['A'] = df_grouped.groupby('Group')['A'].transform(lambda x: x.fillna(x.mean()))
df_grouped
 

MULTIINDEX

Please begin by noting the multi-level index that is produced by the below multi-column aggregation.

import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
                           aggfunc='sum',
                           values=['SaleAmount', 'Quantity'])

table
Ex. 9.21 MultiIndex: slicing at top level. Slice the multi-index for Manager 'Debra H.'.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
                           aggfunc='sum',
                           values=['SaleAmount', 'Quantity'])

table.loc['Debra H.']
 
Ex. 9.22 MultiIndex: slicing at 2nd level. Use a 2-item tuple (first-level label, second-level label) to slice for just 'Debra H.' at the first level and 'Craig B.' at the 2nd level.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
                           aggfunc='sum',
                           values=['SaleAmount', 'Quantity'])

tl = table.loc[('Debra H.', 'Craig B.')]

tl
 
Ex. 9.23 MultiIndex: "cross section" slicing. Use the .xs() method with level=1 to show rows just for 'Craig B.'.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
                           aggfunc='sum',
                           values=['SaleAmount', 'Quantity'])

txs = table.xs('Craig B.', level=1)

txs
 
Ex. 9.24 MultiIndex: use .query('Manager == "Debra H."') to select rows.
import pandas as pd
import numpy as np

df = pd.read_excel("../sales-funnel.xlsx")

table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
                           aggfunc='sum',
                           values=['SaleAmount', 'Quantity'])

table.query("Manager == 'Debra H.'")

# table.query("Manager == 'Debra H.' and SalesRep == 'Craig B.'")

# table.query("SalesRep in ['Craig B.', 'Daniel H.']")
 
Ex. 9.25 MultiIndex: First review the below DataFrame; then use .query('Status in ["pending", "won"]') to select rows.
import pandas as pd

df = pd.read_excel("../sales-funnel.xlsx")

dfq = df.query('Status in ["pending", "won"]')

dfq
 

TIMESERIES

Ex. 9.26 Create a pd.Timestamp object by passing any string that represents a date. Print the object's type and value. Next, try adding a time value to the date and note the result.
import pandas as pd

ts = pd.Timestamp('January 3rd, 2020')

print(type(ts))
print(ts)
 
Ex. 9.27 Create a pd.Timedelta interval object for one day (use the days=1 argument). Add it to the Timestamp object to produce a new Timestamp object with the advanced date. Print the object to see the new date. Next, try adding minutes= parameter to see the date and time advance.
import pandas as pd

ts = pd.Timestamp('3/3/2020')

td = pd.Timedelta(days=1)

ts2 = ts + td

print(ts2)
 
Ex. 9.28 Create a pd.Timedelta interval object by subtracting one date from another. Print the object's value and its type.
import pandas as pd

ts1 = pd.Timestamp('3/3/2020')
ts2 = pd.Timestamp('3/5/2020')

td = ts1 - ts2

print(type(td))

print(td)
 
Ex. 9.29 Generate a date range. Use the pd.date_range() function with a start date, end date and freq= parameter indicating the frequency of dates ('D' for day, 'M' for end of month, 'MS' for start of month, 'B' for business day, etc. -- please see "NB_date_range_and_period_frequency_key.ipynb" for more). Print the object to see its contents and type.
import pandas as pd

dr = pd.date_range('2020-02-03', '2021-02-03', freq='MS')

print(dr)
 
Ex. 9.30 Converting a DataFrame column/index to datetime. View the below DataFrame and note the date column EST. Use pd.to_datetime() or Series.astype() to convert the 'EST' series to Timestamps, then use .set_index() with the column name and drop=True to set the column as the index. Check the dtype of the index.
import pandas as pd

df = pd.read_csv('../weather_newyork.csv')

df.EST = df.EST.astype('datetime64[ns]')

df = df.set_index('EST', drop=True)

print('df: ')
df

print('df.index.dtype')
print(df.index.dtype)

If your dtype shows as '<M8[ns]' it means that the date is 'timezone-aware'; if it shows as 'datetime64[ns]' it signifies 'timezone-naive' (meaning it does not represent the time in a timezone). Since timezone-aware timestamps know their timezone, they can be compared with other timezone-aware dates to do calculations and comparisons between timestamps in different timezones.

 
Ex. 9.31 Select rows based on partial date. Given the below DataFrame, select rows using .loc[] for the following:

  • single date (put the date in any format)
  • single year-month (for example, '2016-03')
  • single year
  • date range as a slice (for example, '2016-03-03': '2016-05-02')

import pandas as pd

df = pd.read_csv('../weather_newyork_2016_2018.csv')

df.EST = pd.to_datetime(df.EST)

df = df.set_index('EST', drop=True)


df.loc['2017-03-01']
print()

df.loc['2016-05']
print()

df.loc['2016']
print()

df.loc['2017-12-28': '2018-01-03']

# some dates for slicing
# '2017-03-01
# '2016-05
# '2016
# '2017-12-28': '2018-01-03
 
Ex. 9.32 Generate daily stock prices. The following code generates synthetic daily stock prices for 120 days (add an integer argument to .head() to see more rows).
import pandas as pd
import numpy as np

np.random.seed(0)

dates = pd.date_range(start='2023-01-01', periods=120, freq='D')

prices = 100 + np.cumsum(np.random.randn(120))  # synthetic price series

df = pd.DataFrame({'date': dates, 'close': prices})

df.head()
 
Ex. 9.33 Move the 'date' column to become a DateTimeIndex. Convert the 'date' column to datetime type and then set it as the DataFrame index.
df['date'] = pd.to_datetime(df['date'])  # ensure datetime type
df.set_index('date', inplace=True)

df.head()
 
Ex. 9.34 Resample to show monthly returns. Use .resample('M') with .mean() to get the average price for each month, then add .pct_change() to calculate percent change.
monthly_returns = df['close'].resample('M').mean().pct_change()

monthly_returns
 
Ex. 9.35 Group the data by month or quarter. Add a column representing the period, then group by that column and compute the mean.
df['month'] = df.index.to_period('M')

# monthly_grouped = df.groupby('month')['close'].mean()

# monthly_grouped


# df['quarter'] = df.index.to_period('Q')
# quarterly_grouped = df.groupby('quarter')['close'].mean()

# quarterly_grouped
 
Ex. 9.36 Compute a 30-day moving average. Use .rolling(window=30).mean() to calculate the 30-day moving average of closing prices.
df['30d_ma'] = df['close'].rolling(window=30).mean()

df[['close', '30d_ma']].tail(10)
 
Ex. 9.37 Compare, align or join with another index using date-based alignment. Create a second time series and join it to the existing DataFrame on the date index.
benchmark = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=120, freq='D'),
    'benchmark_close': 102 + np.cumsum(np.random.randn(120))
})
benchmark.set_index('date', inplace=True)

joined = df.join(benchmark, how='inner')

joined.head()
 
Ex. 9.38 Use .shift(). Shift the close price by one day to compare with the current price and calculate daily return.
df['prev_close'] = df['close'].shift(1)
df['daily_return'] = (df['close'] - df['prev_close']) / df['prev_close']

df[['close', 'prev_close', 'daily_return']].head(10)
 

REVIEW OF LAMBDA FUNCTIONS

Ex. 9.39 Convert a function to lambda. The following function takes one argument and returns the value doubled. Convert to lambda and use in the map() function.
# def doubleit(arg):
#     return arg * 2

seq = [1, 2, 3, 4]

seq2 = map(lambda arg:  arg * 2, seq)

print(list(seq2))           # [2, 4, 6, 8]
 
Ex. 9.40 Convert another function to lambda. The following function takes a dict key and returns the value. Convert to a lambda for use in the sorted() function.
d = {'a': 10, 'b': 2, 'c': 5}

# def byvalue(arg):
#     return d[arg]

skeys = sorted(d, key=lambda dict_key:  d[dict_key])

print(skeys)        # ['b', 'c', 'a'] (in order of value, low-to-high)
 

DATAFRAME .apply()

Ex. 9.41 Apply a function to each item in a Series column. Create new column 'd' that is the value of 'b' * 100, by using .apply() with the centify function (defined below). 'd' column values should be 290.0, 350.0, 490.0. When correct, replace the function with a lambda.
import pandas as pd

# def centify(arg):
#     return arg * 100

dol = {  'a': [1, 2, 3],
         'b': [2.9, 3.5, 4.9],
         'c': [0.1, 0.2, 0.3]  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])

df['d'] = df.b.apply(lambda arg:  arg * 100)

df
 
Ex. 9.42 Apply a function to each item in a Series row. Create new row 'xc' that is the value of 'x' * 100, by using .apply() with the centify function (defined below). 'xc' row values should be 20.0, 35000.0, 200. When correct, replace the function with a lambda.
import pandas as pd

# def centify(arg):
#     return arg * 100

dol = {  'a': [1, 2, 3],
         'b': [290.0, 350.0, 490.0],
         'c': [0.1, 0.2, 0.3]  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])

df.loc['xc'] = df.loc['x'].apply(lambda arg:  arg * 100)

df
 
Ex. 9.43 Apply a function to each Series column in a DataFrame. Call .apply() on the DataFrame, which will pass each column Series to the function in turn. Inside the function, sum up the column values and create a new 'total' row with the sum of values.
import pandas as pd

def sum_col(series_column):
    series_column.loc['total'] = series_column.sum()
    return series_column

dol = {  'a': [1, 2, 3],
         'b': [290.0, 350.0, 490.0],
         'c': [0.1, 0.2, 0.3]  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])

df = df.apply(sum_col)

df
 
Ex. 9.44 Apply a function to each Series row in a DataFrame. Call .apply() on the DataFrame with axis=1, which will pass each row Series to the function in turn. Sum up the row values and create a new 'd' row with the sum of values.
import pandas as pd

def sum_row(series_row):
    series_row['d'] = series_row.sum()
    return series_row

dol = {  'a': [1, 2, 3],
         'b': [290.0, 350.0, 490.0],
         'c': [0.1, 0.2, 0.3]  }

df = pd.DataFrame(dol, index=['w', 'x', 'y'])

df = df.apply(sum_row, axis=1)

df
 
[pr]