Advanced Python
In-Class Exercises, Session 9
Notes if you are using Jupyter Notebook: to call exit() from a notebook, please use sys.exit() (requires import sys). If a strange error occurs, it may be because Jupyter retains variables from all executed cells. To reset the notebook, click 'Restart Kernel' (the circular arrow) -- this will not undo any changes made. Also note that files are located in the parent directory. This means that you must prepend ../ to each filename to find it. |
|
Ex. 9.1 | Include some useful settings. |
import pandas as pd
import numpy as np
import warnings
# enable multiple outputs from jupyter cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set default number of DataFrame rows printed to 8
pd.set_option('display.max_rows', 8)
# suppress excel reader warning: 'extension is unsupported'
warnings.filterwarnings('ignore', category=UserWarning,
module='openpyxl')
# set the "Copy-on-Write" functionality to avoid
# "copy of a slice" warning and behavior
pd.options.mode.copy_on_write = True
|
|
SERIES AND DATAFRAME DTYPES |
|
Ex. 9.2 | Review: check the .dtypes of a DataFrame and .dtype of a Series. |
import pandas as pd
dol = { 'a': [1, 2, 3],
'b': [2.9, 3.5, 4.9],
'c': ['yourstr', 'mystr', 'theirstr'] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
|
|
Ex. 9.3 | Use Series .astype(). Change the 'a' column to np.float64, and 'b' column to str; check dtypes before and after the operation. |
import pandas as pd
import numpy as np
dol = { 'a': [1, 2, 3],
'b': [2.9, 3.5, 4.9],
'c': ['yourstr', 'mystr', 'theirstr'] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
|
|
Ex. 9.4 | use pd.to_numeric. Change the 'a' and 'b' columns to a numeric type, and show dtypes afterwards. |
import pandas as pd
dol = { 'a': ['1', '2', '3'],
'b': ['2.9', '3.5', '4.9'],
'c': ['yourstr', 'mystr', 'theirstr'] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
|
|
AGGREGATIONS WITH df.groupby() AND pd.pivot_table() |
|
Please run this cell to view the sales-funnel spreadsheet: |
|
import pandas as pd
df = pd.read_excel("../sales-funnel.xlsx")
df
|
|
Ex. 9.5 | Single-column aggregation (.groupby() and .pivot_table(). Show sum of 'SaleAmount' grouped by 'SalesRep'. |
import pandas as pd
df = pd.read_excel("../sales-funnel.xlsx")
dfgb = df.groupby('SalesRep').SaleAmount.sum()
dfgb
dfpt = pd.pivot_table(df, index='SalesRep',
values='SaleAmount',
aggfunc='sum')
dfpt
|
|
Ex. 9.6 | Multi-column aggregation (.groupby() and .pivot_table()). Show sums of 'SaleAmount' and 'Quantity' grouped by 'SalesRep' and by 'Product'. |
import pandas as pd
df = pd.read_excel("../sales-funnel.xlsx")
dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()
dfgb
dfpt = pd.pivot_table(df, index=['SalesRep', 'Product'],
values=['SaleAmount', 'Quantity'], aggfunc='sum')
dfpt
|
|
Ex. 9.7 | Further agg by secondary value (.groupby() and .pivot_table()). Perform the above pivot on just 'SalesRep', but use columns= with a list to show the results broken down by 'Product'. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
dfpt = pd.pivot_table(df, index='SalesRep',
aggfunc='sum',
values=['SaleAmount', 'Quantity'],
columns='Product')
dfpt
dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()
dfgb
|
|
Ex. 9.8 | Use .stack() and .unstack() (with .groupby()). Looking at the above example, note that .groupby() produces results in "long" format, while .pivot_table() renders them in "wide" format. Use .unstack() and .stack() to move between the two. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum()
dfgb
dfgbus = dfgb.unstack()
dfgbus
dfgbus.stack()
dfgbus
|
|
Ex. 9.9 | Custom aggregation function (.groupby() and .pivot_table()). Group rows by 'SalesRep' and apply a function to each group that sums up the 'SaleAmount' column. (In other words, this replicates what .groupby().sum() does). |
import pandas as pd
def price_sum(this_df):
return this_df.SaleAmount.sum()
df = pd.read_excel("../sales-funnel.xlsx")
dfgb = df.groupby('SalesRep').apply(price_sum)
dfgb
def price_sum_series(series):
return series.sum()
dfpt = pd.pivot_table(df, index='SalesRep',
values='SaleAmount',
aggfunc=price_sum_series)
dfpt
|
|
Ex. 9.10 | Custom 'summary' function that refers to other columns (.groupby() only). Building on the previous solution, customize the function to sum SaleAmount but reduce 'CPU' values by 20% in the sum (because of the overhead cost of handling hardware). |
import pandas as pd
def price_sum(this_df):
this_df.loc[ this_df.Product.isin(['CPU', 'Monitor']),
'SaleAmount' ] = this_df.SaleAmount * .8
return this_df.SaleAmount.sum()
df = pd.read_excel("../sales-funnel.xlsx")
dfgb = df.groupby('SalesRep').apply(price_sum)
dfgb
|
|
Since .pivot_table() only passes the Series of the column being aggregated, a custom aggregation cannot take into account the values in other columns. (In multi-column aggregations using .pivot_table(), the aggfunc= function is applied separately to each column). |
|
Ex. 9.11 | Custom 'grouping' function (.groupby() only). Pass a function to groupby() that identifies 2 groups using the 'Product' column: if the value is 'CPU' or 'Monitor', return 'hardware'; otherwise, return 'software'. Count the number of rows in each group with .count(). |
import pandas as pd
def grouper(row_idx):
row = df.loc[row_idx]
if row['Product'] in ['Software', 'Maintenance']:
return 'software'
else:
return 'hardware'
df = pd.read_excel("../sales-funnel.xlsx")
df.groupby(grouper).Account.count()
|
|
The function receives a row index which you can use with .loc[] to find the row being grouped. Have the function return 'Software' if the 'Product' value is 'Software' or 'Maintenance', and 'Hardware' if the Product is 'CPU'. .pivot_table() does not offer a way to customize groupings, however you can precompute a grouping key for each row in the DataFrame and then pivot on that. |
|
WORKING WITH MISSING DATA |
|
Ex. 9.12 | Compare None to np.nan. |
import numpy
import pandas
pynone = None # means "empty"
not_a_num = np.nan # means "indeterminate number"
print(f'None: {type(pynone)}: {bool(pynone)}') # <class 'NoneType'>: False
print(f'np.nan: {type(not_a_num)}: {bool(not_a_num)}') # <class 'float'>: True
|
|
np.nan is a float, principally because it needs to work in math calculations (for example, summing up a column that might have empty values). np.nan is True in a boolean context because it's a non-zero float -- so following Python's rules of "truthiness" is must be True. To identify it, we'll use pandas' specific methods .isna(), etc. |
|
Ex. 9.13 | First, review the merged DataFrame and note the missing values. Then check to see if any empty values can be found in the 'course' column. |
import pandas as pd
sgrades = pd.read_csv('../student_db_grades.csv', sep=':')
snames = pd.read_csv('../student_db_names.csv')
sj = pd.merge(sgrades, snames, on='id', how='outer')
sj
print()
# see boolean column of whether missing or not
print('sj.course.isna(): ')
sj.course.isna()
print()
# see whether there are any missing values
print('sj.course.isna().any(): ')
sj.course.isna().any()
print()
# count the number of missing values
print('sj.course.insa().count(): ')
sj.course.isna().count()
print()
# select rows with missing data
print('sj[ sj.course.isna() ]')
sj[ sj.course.isna() ]
print()
|
|
Ex. 9.14 | Summarize the number of missing values in each column. Use .isna().sum() to count missing values and df.info() to inspect column-wise nulls. |
sj.isna().sum()
sj.info()
|
|
Ex. 9.15 | Drop rows or columns that contain missing values. Use .dropna() with axis and thresh parameters to control behavior. |
print('dataframe "sj":')
sj
print()
print('sj.dropna():')
sj.dropna() # drop rows with any NaN
print()
print('sj.dropna(axis=1)')
sj.dropna(axis=1) # drop columns with any NaN
print()
print('sj.dropna(thresh=5):')
sj.dropna(thresh=5) # keep rows with at least 2 non-NaN values
|
|
Ex. 9.16 | Fill missing values using a constant, a computed value, or a method. Use .fillna() with scalar values, aggregates, or directional methods like 'ffill'. |
print('df:')
df
print()
print('df.fillna(0):')
df.fillna(0)
print()
print("df.ffill()")
df.ffill()
print()
print("df['A'].fillna( df['A'].mean() )")
df['A'].fillna( df['A'].mean() )
print()
|
|
Ex. 9.17 | Replace missing values using a general-purpose replacement method. Use .replace() to substitute np.nan or None with other values. |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', 'z', None]
})
print('df:')
df
print('df.replace(np.nan, 999):')
df.replace(np.nan, 999)
|
|
Ex. 9.18 | Interpolate missing values in a numeric column. Use .interpolate() to linearly estimate missing values. |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', 'z', None]
})
df['A'].interpolate()
|
|
Ex. 9.19 | Filter rows based on custom logic for missing data. Use Boolean masks with .isna() and .notna() to target specific missing value conditions. |
print('df:')
df
print()
print("df[ df['A'].notna() & df['B'].isna() ]")
df[ df['A'].notna() & df['B'].isna() ]
print()
|
|
Ex. 9.20 | Fill missing values within groups using the group mean. Use .groupby() combined with .transform() to fill missing data contextually. |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['x', 'y', 'z', None]
})
df_grouped = df.copy()
df_grouped['Group'] = ['G1', 'G1', 'G2', 'G2']
df_grouped['A'] = df_grouped.groupby('Group')['A'].transform(lambda x: x.fillna(x.mean()))
df_grouped
|
|
MULTIINDEX |
|
Please begin by noting the multi-level index that is produced by the below multi-column aggregation. |
|
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
aggfunc='sum',
values=['SaleAmount', 'Quantity'])
table
|
|
Ex. 9.21 | MultiIndex: slicing at top level. Slice the multi-index for Manager 'Debra H.'. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
aggfunc='sum',
values=['SaleAmount', 'Quantity'])
table.loc['Debra H.']
|
|
Ex. 9.22 | MultiIndex: slicing at 2nd level. Use a 2-item tuple (first-level label, second-level label) to slice for just 'Debra H.' at the first level and 'Craig B.' at the 2nd level. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
aggfunc='sum',
values=['SaleAmount', 'Quantity'])
tl = table.loc[('Debra H.', 'Craig B.')]
tl
|
|
Ex. 9.23 | MultiIndex: "cross section" slicing. Use the .xs() method with level=1 to show rows just for 'Craig B.'. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
aggfunc='sum',
values=['SaleAmount', 'Quantity'])
txs = table.xs('Craig B.', level=1)
txs
|
|
Ex. 9.24 | MultiIndex: use .query('Manager == "Debra H."') to select rows. |
import pandas as pd
import numpy as np
df = pd.read_excel("../sales-funnel.xlsx")
table = pd.pivot_table(df, index=['Manager', 'SalesRep'],
aggfunc='sum',
values=['SaleAmount', 'Quantity'])
table.query("Manager == 'Debra H.'")
# table.query("Manager == 'Debra H.' and SalesRep == 'Craig B.'")
# table.query("SalesRep in ['Craig B.', 'Daniel H.']")
|
|
Ex. 9.25 | MultiIndex: First review the below DataFrame; then use .query('Status in ["pending", "won"]') to select rows. |
import pandas as pd
df = pd.read_excel("../sales-funnel.xlsx")
dfq = df.query('Status in ["pending", "won"]')
dfq
|
|
TIMESERIES |
|
Ex. 9.26 | Create a pd.Timestamp object by passing any string that represents a date. Print the object's type and value. Next, try adding a time value to the date and note the result. |
import pandas as pd
ts = pd.Timestamp('January 3rd, 2020')
print(type(ts))
print(ts)
|
|
Ex. 9.27 | Create a pd.Timedelta interval object for one day (use the days=1 argument). Add it to the Timestamp object to produce a new Timestamp object with the advanced date. Print the object to see the new date. Next, try adding minutes= parameter to see the date and time advance. |
import pandas as pd
ts = pd.Timestamp('3/3/2020')
td = pd.Timedelta(days=1)
ts2 = ts + td
print(ts2)
|
|
Ex. 9.28 | Create a pd.Timedelta interval object by subtracting one date from another. Print the object's value and its type. |
import pandas as pd
ts1 = pd.Timestamp('3/3/2020')
ts2 = pd.Timestamp('3/5/2020')
td = ts1 - ts2
print(type(td))
print(td)
|
|
Ex. 9.29 | Generate a date range. Use the pd.date_range() function with a start date, end date and freq= parameter indicating the frequency of dates ('D' for day, 'M' for end of month, 'MS' for start of month, 'B' for business day, etc. -- please see "NB_date_range_and_period_frequency_key.ipynb" for more). Print the object to see its contents and type. |
import pandas as pd
dr = pd.date_range('2020-02-03', '2021-02-03', freq='MS')
print(dr)
|
|
Ex. 9.30 | Converting a DataFrame column/index to datetime. View the below DataFrame and note the date column EST. Use pd.to_datetime() or Series.astype() to convert the 'EST' series to Timestamps, then use .set_index() with the column name and drop=True to set the column as the index. Check the dtype of the index. |
import pandas as pd
df = pd.read_csv('../weather_newyork.csv')
df.EST = df.EST.astype('datetime64[ns]')
df = df.set_index('EST', drop=True)
print('df: ')
df
print('df.index.dtype')
print(df.index.dtype)
|
|
If your dtype shows as '<M8[ns]' it means that the date is 'timezone-aware'; if it shows as 'datetime64[ns]' it signifies 'timezone-naive' (meaning it does not represent the time in a timezone). Since timezone-aware timestamps know their timezone, they can be compared with other timezone-aware dates to do calculations and comparisons between timestamps in different timezones. |
|
Ex. 9.31 | Select rows based on partial date. Given the below DataFrame, select rows using .loc[] for the following: |
|
|
import pandas as pd
df = pd.read_csv('../weather_newyork_2016_2018.csv')
df.EST = pd.to_datetime(df.EST)
df = df.set_index('EST', drop=True)
df.loc['2017-03-01']
print()
df.loc['2016-05']
print()
df.loc['2016']
print()
df.loc['2017-12-28': '2018-01-03']
# some dates for slicing
# '2017-03-01
# '2016-05
# '2016
# '2017-12-28': '2018-01-03
|
|
Ex. 9.32 | Generate daily stock prices. The following code generates synthetic daily stock prices for 120 days (add an integer argument to .head() to see more rows). |
import pandas as pd
import numpy as np
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', periods=120, freq='D')
prices = 100 + np.cumsum(np.random.randn(120)) # synthetic price series
df = pd.DataFrame({'date': dates, 'close': prices})
df.head()
|
|
Ex. 9.33 | Move the 'date' column to become a DateTimeIndex. Convert the 'date' column to datetime type and then set it as the DataFrame index. |
df['date'] = pd.to_datetime(df['date']) # ensure datetime type
df.set_index('date', inplace=True)
df.head()
|
|
Ex. 9.34 | Resample to show monthly returns. Use .resample('M') with .mean() to get the average price for each month, then add .pct_change() to calculate percent change. |
monthly_returns = df['close'].resample('M').mean().pct_change()
monthly_returns
|
|
Ex. 9.35 | Group the data by month or quarter. Add a column representing the period, then group by that column and compute the mean. |
df['month'] = df.index.to_period('M')
# monthly_grouped = df.groupby('month')['close'].mean()
# monthly_grouped
# df['quarter'] = df.index.to_period('Q')
# quarterly_grouped = df.groupby('quarter')['close'].mean()
# quarterly_grouped
|
|
Ex. 9.36 | Compute a 30-day moving average. Use .rolling(window=30).mean() to calculate the 30-day moving average of closing prices. |
df['30d_ma'] = df['close'].rolling(window=30).mean()
df[['close', '30d_ma']].tail(10)
|
|
Ex. 9.37 | Compare, align or join with another index using date-based alignment. Create a second time series and join it to the existing DataFrame on the date index. |
benchmark = pd.DataFrame({
'date': pd.date_range(start='2023-01-01', periods=120, freq='D'),
'benchmark_close': 102 + np.cumsum(np.random.randn(120))
})
benchmark.set_index('date', inplace=True)
joined = df.join(benchmark, how='inner')
joined.head()
|
|
Ex. 9.38 | Use .shift(). Shift the close price by one day to compare with the current price and calculate daily return. |
df['prev_close'] = df['close'].shift(1)
df['daily_return'] = (df['close'] - df['prev_close']) / df['prev_close']
df[['close', 'prev_close', 'daily_return']].head(10)
|
|
REVIEW OF LAMBDA FUNCTIONS |
|
Ex. 9.39 | Convert a function to lambda. The following function takes one argument and returns the value doubled. Convert to lambda and use in the map() function. |
# def doubleit(arg):
# return arg * 2
seq = [1, 2, 3, 4]
seq2 = map(lambda arg: arg * 2, seq)
print(list(seq2)) # [2, 4, 6, 8]
|
|
Ex. 9.40 | Convert another function to lambda. The following function takes a dict key and returns the value. Convert to a lambda for use in the sorted() function. |
d = {'a': 10, 'b': 2, 'c': 5}
# def byvalue(arg):
# return d[arg]
skeys = sorted(d, key=lambda dict_key: d[dict_key])
print(skeys) # ['b', 'c', 'a'] (in order of value, low-to-high)
|
|
DATAFRAME .apply() |
|
Ex. 9.41 | Apply a function to each item in a Series column. Create new column 'd' that is the value of 'b' * 100, by using .apply() with the centify function (defined below). 'd' column values should be 290.0, 350.0, 490.0. When correct, replace the function with a lambda. |
import pandas as pd
# def centify(arg):
# return arg * 100
dol = { 'a': [1, 2, 3],
'b': [2.9, 3.5, 4.9],
'c': [0.1, 0.2, 0.3] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
df['d'] = df.b.apply(lambda arg: arg * 100)
df
|
|
Ex. 9.42 | Apply a function to each item in a Series row. Create new row 'xc' that is the value of 'x' * 100, by using .apply() with the centify function (defined below). 'xc' row values should be 20.0, 35000.0, 200. When correct, replace the function with a lambda. |
import pandas as pd
# def centify(arg):
# return arg * 100
dol = { 'a': [1, 2, 3],
'b': [290.0, 350.0, 490.0],
'c': [0.1, 0.2, 0.3] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
df.loc['xc'] = df.loc['x'].apply(lambda arg: arg * 100)
df
|
|
Ex. 9.43 | Apply a function to each Series column in a DataFrame. Call .apply() on the DataFrame, which will pass each column Series to the function in turn. Inside the function, sum up the column values and create a new 'total' row with the sum of values. |
import pandas as pd
def sum_col(series_column):
series_column.loc['total'] = series_column.sum()
return series_column
dol = { 'a': [1, 2, 3],
'b': [290.0, 350.0, 490.0],
'c': [0.1, 0.2, 0.3] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
df = df.apply(sum_col)
df
|
|
Ex. 9.44 | Apply a function to each Series row in a DataFrame. Call .apply() on the DataFrame with axis=1, which will pass each row Series to the function in turn. Sum up the row values and create a new 'd' row with the sum of values. |
import pandas as pd
def sum_row(series_row):
series_row['d'] = series_row.sum()
return series_row
dol = { 'a': [1, 2, 3],
'b': [290.0, 350.0, 490.0],
'c': [0.1, 0.2, 0.3] }
df = pd.DataFrame(dol, index=['w', 'x', 'y'])
df = df.apply(sum_row, axis=1)
df
|
|