In-Class Exercises, Session 9

Advanced Python
In-Class Exercises, Session 9


	Notes if you are using Jupyter Notebook: to call exit() from a notebook, please use sys.exit() (requires import sys). If a strange error occurs, it may be because Jupyter retains variables from all executed cells. To reset the notebook, click 'Restart Kernel' (the circular arrow) -- this will not undo any changes made. Also note that files are located in the parent directory. This means that you must prepend ../ to each filename to find it.
Ex. 9.1	Include some useful settings.
	import pandas as pd import numpy as np import warnings # enable multiple outputs from jupyter cells from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" # set default number of DataFrame rows printed to 8 pd.set_option('display.max_rows', 8) # suppress excel reader warning: 'extension is unsupported' warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl') # set the "Copy-on-Write" functionality to avoid # "copy of a slice" warning and behavior pd.options.mode.copy_on_write = True


	SERIES AND DATAFRAME DTYPES
Ex. 9.2	Review: check the .dtypes of a DataFrame and .dtype of a Series.
	import pandas as pd dol = { 'a': [1, 2, 3], 'b': [2.9, 3.5, 4.9], 'c': ['yourstr', 'mystr', 'theirstr'] } df = pd.DataFrame(dol, index=['w', 'x', 'y'])

Ex. 9.3	Use Series .astype(). Change the 'a' column to np.float64, and 'b' column to str; check dtypes before and after the operation.
	import pandas as pd import numpy as np dol = { 'a': [1, 2, 3], 'b': [2.9, 3.5, 4.9], 'c': ['yourstr', 'mystr', 'theirstr'] } df = pd.DataFrame(dol, index=['w', 'x', 'y'])

Ex. 9.4	use pd.to_numeric. Change the 'a' and 'b' columns to a numeric type, and show dtypes afterwards.
	import pandas as pd dol = { 'a': ['1', '2', '3'], 'b': ['2.9', '3.5', '4.9'], 'c': ['yourstr', 'mystr', 'theirstr'] } df = pd.DataFrame(dol, index=['w', 'x', 'y'])


	AGGREGATIONS WITH df.groupby() AND pd.pivot_table()

	Please run this cell to view the sales-funnel spreadsheet:
	import pandas as pd df = pd.read_excel("../sales-funnel.xlsx") df
Ex. 9.5	Single-column aggregation (.groupby() and .pivot_table(). Show sum of 'SaleAmount' grouped by 'SalesRep'.
	import pandas as pd df = pd.read_excel("../sales-funnel.xlsx") dfgb = df.groupby('SalesRep').SaleAmount.sum() dfgb dfpt = pd.pivot_table(df, index='SalesRep', values='SaleAmount', aggfunc='sum') dfpt

Ex. 9.6	Multi-column aggregation (.groupby() and .pivot_table()). Show sums of 'SaleAmount' and 'Quantity' grouped by 'SalesRep' and by 'Product'.
	import pandas as pd df = pd.read_excel("../sales-funnel.xlsx") dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum() dfgb dfpt = pd.pivot_table(df, index=['SalesRep', 'Product'], values=['SaleAmount', 'Quantity'], aggfunc='sum') dfpt

Ex. 9.7	Further agg by secondary value (.groupby() and .pivot_table()). Perform the above pivot on just 'SalesRep', but use columns= with a list to show the results broken down by 'Product'.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") dfpt = pd.pivot_table(df, index='SalesRep', aggfunc='sum', values=['SaleAmount', 'Quantity'], columns='Product') dfpt dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum() dfgb

Ex. 9.8	Use .stack() and .unstack() (with .groupby()). Looking at the above example, note that .groupby() produces results in "long" format, while .pivot_table() renders them in "wide" format. Use .unstack() and .stack() to move between the two.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") dfgb = df.groupby(['SalesRep', 'Product'])[['SaleAmount', 'Quantity']].sum() dfgb dfgbus = dfgb.unstack() dfgbus dfgbus.stack() dfgbus

Ex. 9.9	Custom aggregation function (.groupby() and .pivot_table()). Group rows by 'SalesRep' and apply a function to each group that sums up the 'SaleAmount' column. (In other words, this replicates what .groupby().sum() does).
	import pandas as pd def price_sum(this_df): return this_df.SaleAmount.sum() df = pd.read_excel("../sales-funnel.xlsx") dfgb = df.groupby('SalesRep').apply(price_sum) dfgb def price_sum_series(series): return series.sum() dfpt = pd.pivot_table(df, index='SalesRep', values='SaleAmount', aggfunc=price_sum_series) dfpt

Ex. 9.10	Custom 'summary' function that refers to other columns (.groupby() only). Building on the previous solution, customize the function to sum SaleAmount but reduce 'CPU' values by 20% in the sum (because of the overhead cost of handling hardware).
	import pandas as pd def price_sum(this_df): this_df.loc[ this_df.Product.isin(['CPU', 'Monitor']), 'SaleAmount' ] = this_df.SaleAmount * .8 return this_df.SaleAmount.sum() df = pd.read_excel("../sales-funnel.xlsx") dfgb = df.groupby('SalesRep').apply(price_sum) dfgb
	Since .pivot_table() only passes the Series of the column being aggregated, a custom aggregation cannot take into account the values in other columns. (In multi-column aggregations using .pivot_table(), the aggfunc= function is applied separately to each column).

Ex. 9.11	Custom 'grouping' function (.groupby() only). Pass a function to groupby() that identifies 2 groups using the 'Product' column: if the value is 'CPU' or 'Monitor', return 'hardware'; otherwise, return 'software'. Count the number of rows in each group with .count().
	import pandas as pd def grouper(row_idx): row = df.loc[row_idx] if row['Product'] in ['Software', 'Maintenance']: return 'software' else: return 'hardware' df = pd.read_excel("../sales-funnel.xlsx") df.groupby(grouper).Account.count()
	The function receives a row index which you can use with .loc[] to find the row being grouped. Have the function return 'Software' if the 'Product' value is 'Software' or 'Maintenance', and 'Hardware' if the Product is 'CPU'. .pivot_table() does not offer a way to customize groupings, however you can precompute a grouping key for each row in the DataFrame and then pivot on that.


	WORKING WITH MISSING DATA
Ex. 9.12	Compare None to np.nan.
	import numpy import pandas pynone = None # means "empty" not_a_num = np.nan # means "indeterminate number" print(f'None: {type(pynone)}: {bool(pynone)}') # <class 'NoneType'>: False print(f'np.nan: {type(not_a_num)}: {bool(not_a_num)}') # <class 'float'>: True
	np.nan is a float, principally because it needs to work in math calculations (for example, summing up a column that might have empty values). np.nan is True in a boolean context because it's a non-zero float -- so following Python's rules of "truthiness" is must be True. To identify it, we'll use pandas' specific methods .isna(), etc.

Ex. 9.13	First, review the merged DataFrame and note the missing values. Then check to see if any empty values can be found in the 'course' column.
	import pandas as pd sgrades = pd.read_csv('../student_db_grades.csv', sep=':') snames = pd.read_csv('../student_db_names.csv') sj = pd.merge(sgrades, snames, on='id', how='outer') sj print() # see boolean column of whether missing or not print('sj.course.isna(): ') sj.course.isna() print() # see whether there are any missing values print('sj.course.isna().any(): ') sj.course.isna().any() print() # count the number of missing values print('sj.course.insa().count(): ') sj.course.isna().count() print() # select rows with missing data print('sj[ sj.course.isna() ]') sj[ sj.course.isna() ] print()

Ex. 9.14	Summarize the number of missing values in each column. Use .isna().sum() to count missing values and df.info() to inspect column-wise nulls.
	sj.isna().sum() sj.info()

Ex. 9.15	Drop rows or columns that contain missing values. Use .dropna() with axis and thresh parameters to control behavior.
	print('dataframe "sj":') sj print() print('sj.dropna():') sj.dropna() # drop rows with any NaN print() print('sj.dropna(axis=1)') sj.dropna(axis=1) # drop columns with any NaN print() print('sj.dropna(thresh=5):') sj.dropna(thresh=5) # keep rows with at least 2 non-NaN values

Ex. 9.16	Fill missing values using a constant, a computed value, or a method. Use .fillna() with scalar values, aggregates, or directional methods like 'ffill'.
	print('df:') df print() print('df.fillna(0):') df.fillna(0) print() print("df.ffill()") df.ffill() print() print("df['A'].fillna( df['A'].mean() )") df['A'].fillna( df['A'].mean() ) print()

Ex. 9.17	Replace missing values using a general-purpose replacement method. Use .replace() to substitute np.nan or None with other values.
	import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': ['x', 'y', 'z', None] }) print('df:') df print('df.replace(np.nan, 999):') df.replace(np.nan, 999)

Ex. 9.18	Interpolate missing values in a numeric column. Use .interpolate() to linearly estimate missing values.
	import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': ['x', 'y', 'z', None] }) df['A'].interpolate()

Ex. 9.19	Filter rows based on custom logic for missing data. Use Boolean masks with .isna() and .notna() to target specific missing value conditions.
	print('df:') df print() print("df[ df['A'].notna() & df['B'].isna() ]") df[ df['A'].notna() & df['B'].isna() ] print()

Ex. 9.20	Fill missing values within groups using the group mean. Use .groupby() combined with .transform() to fill missing data contextually.
	import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': ['x', 'y', 'z', None] }) df_grouped = df.copy() df_grouped['Group'] = ['G1', 'G1', 'G2', 'G2'] df_grouped['A'] = df_grouped.groupby('Group')['A'].transform(lambda x: x.fillna(x.mean())) df_grouped


	MULTIINDEX

	Please begin by noting the multi-level index that is produced by the below multi-column aggregation.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") table = pd.pivot_table(df, index=['Manager', 'SalesRep'], aggfunc='sum', values=['SaleAmount', 'Quantity']) table
Ex. 9.21	MultiIndex: slicing at top level. Slice the multi-index for Manager 'Debra H.'.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") table = pd.pivot_table(df, index=['Manager', 'SalesRep'], aggfunc='sum', values=['SaleAmount', 'Quantity']) table.loc['Debra H.']

Ex. 9.22	MultiIndex: slicing at 2nd level. Use a 2-item tuple (first-level label, second-level label) to slice for just 'Debra H.' at the first level and 'Craig B.' at the 2nd level.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") table = pd.pivot_table(df, index=['Manager', 'SalesRep'], aggfunc='sum', values=['SaleAmount', 'Quantity']) tl = table.loc[('Debra H.', 'Craig B.')] tl

Ex. 9.23	MultiIndex: "cross section" slicing. Use the .xs() method with level=1 to show rows just for 'Craig B.'.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") table = pd.pivot_table(df, index=['Manager', 'SalesRep'], aggfunc='sum', values=['SaleAmount', 'Quantity']) txs = table.xs('Craig B.', level=1) txs

Ex. 9.24	MultiIndex: use .query('Manager == "Debra H."') to select rows.
	import pandas as pd import numpy as np df = pd.read_excel("../sales-funnel.xlsx") table = pd.pivot_table(df, index=['Manager', 'SalesRep'], aggfunc='sum', values=['SaleAmount', 'Quantity']) table.query("Manager == 'Debra H.'") # table.query("Manager == 'Debra H.' and SalesRep == 'Craig B.'") # table.query("SalesRep in ['Craig B.', 'Daniel H.']")

Ex. 9.25	MultiIndex: First review the below DataFrame; then use .query('Status in ["pending", "won"]') to select rows.
	import pandas as pd df = pd.read_excel("../sales-funnel.xlsx") dfq = df.query('Status in ["pending", "won"]') dfq


	TIMESERIES
Ex. 9.26	Create a pd.Timestamp object by passing any string that represents a date. Print the object's type and value. Next, try adding a time value to the date and note the result.
	import pandas as pd ts = pd.Timestamp('January 3rd, 2020') print(type(ts)) print(ts)

Ex. 9.27	Create a pd.Timedelta interval object for one day (use the days=1 argument). Add it to the Timestamp object to produce a new Timestamp object with the advanced date. Print the object to see the new date. Next, try adding minutes= parameter to see the date and time advance.
	import pandas as pd ts = pd.Timestamp('3/3/2020') td = pd.Timedelta(days=1) ts2 = ts + td print(ts2)

Ex. 9.28	Create a pd.Timedelta interval object by subtracting one date from another. Print the object's value and its type.
	import pandas as pd ts1 = pd.Timestamp('3/3/2020') ts2 = pd.Timestamp('3/5/2020') td = ts1 - ts2 print(type(td)) print(td)

Ex. 9.29	Generate a date range. Use the pd.date_range() function with a start date, end date and freq= parameter indicating the frequency of dates ('D' for day, 'M' for end of month, 'MS' for start of month, 'B' for business day, etc. -- please see "NB_date_range_and_period_frequency_key.ipynb" for more). Print the object to see its contents and type.
	import pandas as pd dr = pd.date_range('2020-02-03', '2021-02-03', freq='MS') print(dr)

Ex. 9.30	Converting a DataFrame column/index to datetime. View the below DataFrame and note the date column EST. Use pd.to_datetime() or Series.astype() to convert the 'EST' series to Timestamps, then use .set_index() with the column name and drop=True to set the column as the index. Check the dtype of the index.
	import pandas as pd df = pd.read_csv('../weather_newyork.csv') df.EST = df.EST.astype('datetime64[ns]') df = df.set_index('EST', drop=True) print('df: ') df print('df.index.dtype') print(df.index.dtype)
	If your dtype shows as '<M8[ns]' it means that the date is 'timezone-aware'; if it shows as 'datetime64[ns]' it signifies 'timezone-naive' (meaning it does not represent the time in a timezone). Since timezone-aware timestamps know their timezone, they can be compared with other timezone-aware dates to do calculations and comparisons between timestamps in different timezones.

Ex. 9.31	Select rows based on partial date. Given the below DataFrame, select rows using .loc[] for the following:
	single date (put the date in any format) single year-month (for example, '2016-03') single year date range as a slice (for example, '2016-03-03': '2016-05-02')
	import pandas as pd df = pd.read_csv('../weather_newyork_2016_2018.csv') df.EST = pd.to_datetime(df.EST) df = df.set_index('EST', drop=True) df.loc['2017-03-01'] print() df.loc['2016-05'] print() df.loc['2016'] print() df.loc['2017-12-28': '2018-01-03'] # some dates for slicing # '2017-03-01 # '2016-05 # '2016 # '2017-12-28': '2018-01-03

Ex. 9.32	Generate daily stock prices. The following code generates synthetic daily stock prices for 120 days (add an integer argument to .head() to see more rows).
	import pandas as pd import numpy as np np.random.seed(0) dates = pd.date_range(start='2023-01-01', periods=120, freq='D') prices = 100 + np.cumsum(np.random.randn(120)) # synthetic price series df = pd.DataFrame({'date': dates, 'close': prices}) df.head()

Ex. 9.33	Move the 'date' column to become a DateTimeIndex. Convert the 'date' column to datetime type and then set it as the DataFrame index.
	df['date'] = pd.to_datetime(df['date']) # ensure datetime type df.set_index('date', inplace=True) df.head()

Ex. 9.34	Resample to show monthly returns. Use .resample('M') with .mean() to get the average price for each month, then add .pct_change() to calculate percent change.
	monthly_returns = df['close'].resample('M').mean().pct_change() monthly_returns

Ex. 9.35	Group the data by month or quarter. Add a column representing the period, then group by that column and compute the mean.
	df['month'] = df.index.to_period('M') # monthly_grouped = df.groupby('month')['close'].mean() # monthly_grouped # df['quarter'] = df.index.to_period('Q') # quarterly_grouped = df.groupby('quarter')['close'].mean() # quarterly_grouped

Ex. 9.36	Compute a 30-day moving average. Use .rolling(window=30).mean() to calculate the 30-day moving average of closing prices.
	df['30d_ma'] = df['close'].rolling(window=30).mean() df[['close', '30d_ma']].tail(10)

Ex. 9.37	Compare, align or join with another index using date-based alignment. Create a second time series and join it to the existing DataFrame on the date index.
	benchmark = pd.DataFrame({ 'date': pd.date_range(start='2023-01-01', periods=120, freq='D'), 'benchmark_close': 102 + np.cumsum(np.random.randn(120)) }) benchmark.set_index('date', inplace=True) joined = df.join(benchmark, how='inner') joined.head()

Ex. 9.38	Use .shift(). Shift the close price by one day to compare with the current price and calculate daily return.
	df['prev_close'] = df['close'].shift(1) df['daily_return'] = (df['close'] - df['prev_close']) / df['prev_close'] df[['close', 'prev_close', 'daily_return']].head(10)


	REVIEW OF LAMBDA FUNCTIONS
Ex. 9.39	Convert a function to lambda. The following function takes one argument and returns the value doubled. Convert to lambda and use in the map() function.
	# def doubleit(arg): # return arg * 2 seq = [1, 2, 3, 4] seq2 = map(lambda arg: arg * 2, seq) print(list(seq2)) # [2, 4, 6, 8]

Ex. 9.40	Convert another function to lambda. The following function takes a dict key and returns the value. Convert to a lambda for use in the sorted() function.
	d = {'a': 10, 'b': 2, 'c': 5} # def byvalue(arg): # return d[arg] skeys = sorted(d, key=lambda dict_key: d[dict_key]) print(skeys) # ['b', 'c', 'a'] (in order of value, low-to-high)


	DATAFRAME .apply()
Ex. 9.41	Apply a function to each item in a Series column. Create new column 'd' that is the value of 'b' * 100, by using .apply() with the centify function (defined below). 'd' column values should be 290.0, 350.0, 490.0. When correct, replace the function with a lambda.
	import pandas as pd # def centify(arg): # return arg * 100 dol = { 'a': [1, 2, 3], 'b': [2.9, 3.5, 4.9], 'c': [0.1, 0.2, 0.3] } df = pd.DataFrame(dol, index=['w', 'x', 'y']) df['d'] = df.b.apply(lambda arg: arg * 100) df

Ex. 9.42	Apply a function to each item in a Series row. Create new row 'xc' that is the value of 'x' * 100, by using .apply() with the centify function (defined below). 'xc' row values should be 20.0, 35000.0, 200. When correct, replace the function with a lambda.
	import pandas as pd # def centify(arg): # return arg * 100 dol = { 'a': [1, 2, 3], 'b': [290.0, 350.0, 490.0], 'c': [0.1, 0.2, 0.3] } df = pd.DataFrame(dol, index=['w', 'x', 'y']) df.loc['xc'] = df.loc['x'].apply(lambda arg: arg * 100) df

Ex. 9.43	Apply a function to each Series column in a DataFrame. Call .apply() on the DataFrame, which will pass each column Series to the function in turn. Inside the function, sum up the column values and create a new 'total' row with the sum of values.
	import pandas as pd def sum_col(series_column): series_column.loc['total'] = series_column.sum() return series_column dol = { 'a': [1, 2, 3], 'b': [290.0, 350.0, 490.0], 'c': [0.1, 0.2, 0.3] } df = pd.DataFrame(dol, index=['w', 'x', 'y']) df = df.apply(sum_col) df

Ex. 9.44	Apply a function to each Series row in a DataFrame. Call .apply() on the DataFrame with axis=1, which will pass each row Series to the function in turn. Sum up the row values and create a new 'd' row with the sum of values.
	import pandas as pd def sum_row(series_row): series_row['d'] = series_row.sum() return series_row dol = { 'a': [1, 2, 3], 'b': [290.0, 350.0, 490.0], 'c': [0.1, 0.2, 0.3] } df = pd.DataFrame(dol, index=['w', 'x', 'y']) df = df.apply(sum_row, axis=1) df