Advanced Python
Homework Discussion, Session 8

NOTE: WHEN COMPLETE, PLEASE EMAIL YOUR .ipynb FILE, To submit your solutions, save your notebook, then please send an email to me with the session_8_homework.ipynb file attached. (If you'd like to make my life a tiny bit easier, please rename the notebook with your name, for example session_8_homework_jim_jansen,ipynb - thanks!)

8.1 The commonly used import statement is import pandas as pd. We use this abbreviation to save typing.
 
8.2 Use the read_excel() function of pandas to read the workbook. Since a workbook has several sheets (this one has two) you should also specify the sheet name with the 'sheetname=' parameter argument to read_excel().

If you want to explore the options for any function or method, you can use the help() Python function:

help(pd.read_excel)
help(df.rename)   # assuming a Dataframe in df

Also note that to make your output look like mine you must print the object; but if you want to see the data displayed in the comely "gridlines" output, you can simply reference the variable in the last line of the cell.

 
8.3 Use the .rename() method with a dict mapping. Although you can easily set these values using the .columns attribute of the DataFrame (df1.columns=), you should use this only sparingly. .rename() species the mapping, while .columns just blindly resets the labels.
 
8.4 If we intend to eventually change the data on the resulting slice (rather than just view it), we should use .loc[] so we know this slice is not a temporary copy. .loc[] would have two arguments here -- a slice indicating all rows, and a list with the names of the columns. See slides.
 
8.5 Simple broadcast behavior when creating a new column. See slides.
 
8.6 Simple aggregate function applied to the DataFrame -- see slides. Note that my results include 'CTRY_CODE' and that Python attempted to "sum" the strings in that column. Obviously this should be excluded -- preferably in the same statement, by using a slice and then attaching the aggregate function.
 
8.7 Simple aggregate function applied to a column Series selected from the DataFrame. My print statement is simply printing the string and the function output as two items in the same print() statement.
 
8.8 We're going back to the original df1 because we want to include columns that were excluded in our df2. As we are not intending to make further changes to the output, we have the option of using a simple subscript; or, we can use .loc[].
 
8.9 This requires a .loc[] or subscript condition with a compound test (i.e., two logical tests). Keep in mind that each test must be surrounded by its own parentheses, and we use the & ("and") and | ("or") to combine the tests.
 
8.10 Note again that my output is showing the average for both of the numeric columns, which is the default when neither is isolated. To show just the average of the one column, you would subscript the dataframe for that column and apply the aggregate function to the resulting Series.

The scientific notation shown in the result is apparently a challenge to display otherwise (it would need to be replaced with a string formatter). Of course if exported to Excel it would be able to be represented in any form supported by Excel.

 
8.11 This can be done in two statements, one for each change.

Again, using .loc means that we must pass two arguments: the row selector (which in this case is the condition selecting 'Eurozone' or 'Non'), and the column statement (which can be just the string column name). However this use of .loc[] will assign the corrected value ('Y' and 'N') to the .loc[] statement -- basically saying, "for these selected rows, make this column equal to this new value"

 
8.12 Simple groupby() aggregation. See slides.
 
8.13 This should be a Series where one column of the excel sheet is the index, and the other column are the values in the Series. We want the CTRY_CODE to be the index because when we do a join the index will be the default 'right_on=' values. The FULL_NAME values should then become a new column in the dataframe.
 
8.14 This is a left join (the default) that favors the rows in the left table (the dataframe we have been working on). The default join column is the index, so you can specify what to join on if you prefer not to join on the index.

If joining the df with a Series, the Series must have a name (which will become the name of the new column). If using df.join(), the "other" DataFrame or Series will be joined on its index (you can specify the join column for "this" DataFrame with the on= parameter argument). Or, you can use pd.merge(), which requires 2 DataFrames. These will also default to joining on the indices, but you can specify other columns to join on with the left_on= and right_on= args. Send me questions!

 
[pr]