Introduction to Python

davidbpython.com




Project Discussion, Session 5



5.1 Notes typing assignment. Please write out this week's transcription notes.
 
5.2 Filepaths Exercises.

First, you must not use an absolute path (like those that start with a C:\ or other drive letter, or those that start with a slash (/). These can be used to correctly locate a file, but we are here to learn how to construct relative paths). A relative path is one that locates a file relative to the location of the script. Therefore, we must first consider the location of the script within the filepath. There are four relative locations to consider:

  1. file in same directoryas script: simply use the filename, no path information
  2. file in directory "above" script (for example, dir2a is one level above dir1): precede the filename with the directory name where the file is located and a slash, i.e. dir2a/. If the directory is multiple levels above, "chain" them into a path (for example, to get to a file in dir3a from dir1, the path is dir2a/dir3a/)
  3. file in directory "below" script (for example, dir1 is one level below dir2a): precede the filename with the "parent" directory shortcut and a slash, i.e. ../. If the directory is multiple levels down, "chain" them into a path (for example, to get to a file in dir1 from dir3a, the path is ../../
  4. file in "sibling" directory as script (for example, dir2a and dir2b are at the same directory level): use the .. to go "down" to the parent directory where the target directory is located, then add the target directory name (for example, to get to a file in dir2a from dir2b, the path is ../dir2a/)

Troubleshooting:

  • make sure all names are correctly spelled - they are very similar, so take extra care when reviewing
  • make sure you are including a slash between each folder and file, so ..dir1 should be ../dir1
  • make sure the path is relative, so it must not start with a slash or drive letter
  • check and recheck the relative location between the script you are running and the target file
  • if having difficulty reaching a file that is multiple levels away, you can try to "walk" your way to the file by opening a closer file on the path

 
5.3 Lookup Dictionary. Reading file states.csv (see the file in this session's source data), build a dict of pairs with each state's name as key and the abbreviation as the value (for example, New York as key and NY as value). Then, read user input for a state name. If the state name is a key in the dict, display the abbreviation for that state.

The program uses a 2-digit state name (through input()) and looks up the state name in the dict to retrieve and print the state's abbreviation. If the input name is not a state listed in the dict, the program prints "no state found with the name [state name]" (where [state name] is the input name)

Sample program runs:
there are 50 pairs in the lookup dict
please enter a state name:  California
CA
there are 50 pairs in the lookup dict
please enter a state name:  New York
NY
there are 50 pairs in the lookup dict
please enter a state name:  Oman
no state found with name "Oman"


Overview This is one common use of a dictionary: a "lookup" --in this case state names to state abbreviations. We will:

  • load the dictionary with the data from the file, with the state name as the dictionary key, and the abbreviation as the dictionary value
  • take the length of the dict and announce this number
  • allow the user to submit a state name to look up in the dict
  • check to see if the state name is in the dictionary; if it is not, print a message and exit
  • look up the state name in the dict to see the state abbreviation
  • print the state abbreviation


Read user input: no error checking is required on the input. Build the dict: read through the file line-by-line with for, and for each line in the file, split the line into state abbreviation, population, area and state name, and then add them to the dict as a key/value pair using subscript assignment syntax (see "add a key/value pair to a dictionary" in the slides for correct syntax). Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in Session 4 Slides. Test this step: your dict should have a len() of 50 and when printed should show the state name as the key and the state abbreviation as the value in each pair. Check to see if the user's input state name is a key in the dict. Use the 'in' operator to test to see if the user's input is one of the dict keys (see "check for key membership" in the slides or "in membership test" in pythonreference.com). If the key is not in the dict, exit() with the message no state found with name "[input]" (where [input] is the user's input state name). Print the value associated with the user's input. See "read a value based on a key" in the slides.

 
5.4 Lookup Dictionary with try/except (notations not required for this solution). Replace the 'if state not in dict' language (i.e., checking to see if the user's state name is a key in the dict) with a try/except. So your solution will not check for the key ahead of time, and will instead trap the exception when it occurs.

First, remove the 'if state not in dict' block. We will not test to see if the key is in the dict. Next, run the code and give the program a bad state name, i.e. one that does not exist in the dictionary. Next, identify two things:

  1. The exception type that is raised if the key can't be found. You can of course find this by looking near the start of the Traceback message - look for the CamelCase word
  2. The code line from which the exception is raised. This is in the middle of the Traceback message.

Next, wrap the try: block around only the line where the exception is expected, and no other lines. You must not include lines that aren't related to the exception line -- this is an important best practice. (Sometimes you may need to wrap lines that are part of a block, for example 'for' blocks.) Next, follow the try: block with an except:, and specify the exception that was raised in your earlier test. Inside the except: block, print the same error message you had used earlier to signal that the state name key was not found in the dictionary.

 
5.5 Ranking. Reading cities_green_space.csv, build a dictionary that pairs city name keys with "pct" float values.
{'Amsterdam': 13.0, 'Austin': 10.0, 'Barcelona': 28.0, 'Bogotá': 4.9,
 'Brussels': 18.8, 'Buenos Aires': 9.4, 'Cape Town': 24.0, 'Chengdu': 42.3,
 'Dublin': 26.0, 'Edinburgh': 49.2, 'Guangzhou': 19.78, 'Helsinki': 40.0,
 'Hong Kong': 40.0, 'Istanbul': 2.2, 'Johannesburg': 24.0, 'Lisbon': 18.0,
 'London': 33.0, 'Los Angeles': 34.7, 'Melbourne': 9.3, 'Milan': 13.74,
 'Montréal': 12.82, 'Moscow': 18.0, 'Nanjing': 40.67, 'New York': 27.0,
 'Oslo': 68.0, 'Paris': 10.0, 'Rome': 38.9, 'San Francisco': 13.0,
 'Seoul': 27.91, 'Shanghai': 16.2, 'Shenzhen': 40.9, 'Singapore': 47.0,
 'Stockholm': 40.0, 'Sydney': 46.0, 'Taipei': 6.56, 'Tokyo': 7.5,
 'Toronto': 13.0, 'Vienna': 50.0, 'Warsaw': 17.0, 'Zürich': 41.0}

Next, print city name and its pct value.

Expected Output:
Cities Ranked by Greenspace (% of total area)

Oslo 68.0
Vienna 50.0
Edinburgh 49.2
Singapore 47.0
Sydney 46.0
Chengdu 42.3
Zürich 41.0
Shenzhen 40.9
Nanjing 40.67
Helsinki 40.0
Hong Kong 40.0
Stockholm 40.0
Rome 38.9
Los Angeles 34.7
London 33.0
Barcelona 28.0
Seoul 27.91
New York 27.0
Dublin 26.0
Cape Town 24.0
Johannesburg 24.0
Guangzhou 19.78
Brussels 18.8
Lisbon 18.0
Moscow 18.0
Warsaw 17.0
Shanghai 16.2
Milan 13.74
Amsterdam 13.0
San Francisco 13.0
Toronto 13.0
Montréal 12.82
Austin 10.0
Paris 10.0
Buenos Aires 9.4
Melbourne 9.3
Tokyo 7.5
Taipei 6.56
Bogotá 4.9
Istanbul 2.2


Overview This is another common use of a dictionary: a ranking that orders keys by a paired value. We will:

  • load the dictionary with the data from the file, with the city name (first field of cities_green_space.csv) as the dictionary key, and the Pct values (second field) as the dictionary value
  • sort the dict keys by value, producing a list
  • loop through the sorted list of keys, printing each key and then looking up and printing the value for that key in the dict


Build the dict: read through the file line-by-line with for, and for each line in the file, split the line into a list of six strings, or assign to six variables using multi-target assignment; .rstrip() or slice the Pct value to remove the trailing '%' sign, then convert to float; and then add the city name and float value to the dict as a key/value pair using subscript assignment syntax (see "add a key/value pair to a dictionary" in the slides for correct syntax).

Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in Session 4 Slides.

Test this step: your dict should have a len() of 40 and when printed should show city names and percentage values; make sure the values are floats or they won't sort properly:

{'Amsterdam': 13.0, 'Austin': 10.0, 'Barcelona': 28.0, 'Bogotá': 4.9,
 'Brussels': 18.8, 'Buenos Aires': 9.4, 'Cape Town': 24.0, 'Chengdu': 42.3,
 'Dublin': 26.0, 'Edinburgh': 49.2, 'Guangzhou': 19.78, 'Helsinki': 40.0,
 'Hong Kong': 40.0, 'Istanbul': 2.2, 'Johannesburg': 24.0, 'Lisbon': 18.0,
 'London': 33.0, 'Los Angeles': 34.7, 'Melbourne': 9.3, 'Milan': 13.74,
 'Montréal': 12.82, 'Moscow': 18.0, 'Nanjing': 40.67, 'New York': 27.0,
 'Oslo': 68.0, 'Paris': 10.0, 'Rome': 38.9, 'San Francisco': 13.0,
 'Seoul': 27.91, 'Shanghai': 16.2, 'Shenzhen': 40.9, 'Singapore': 47.0,
 'Stockholm': 40.0, 'Sydney': 46.0, 'Taipei': 6.56, 'Tokyo': 7.5,
 'Toronto': 13.0, 'Vienna': 50.0, 'Warsaw': 17.0, 'Zürich': 41.0}

Sort the dict keys. Call sorted() passing the dict as argument, including the special key= argument for sorting a dict by value and reverse=True. This should return a list of dict keys sorted by value. Test this step: your sorted list should show the companies ordered by the revenue value, high to low:

['Oslo', 'Vienna', 'Edinburgh', 'Singapore', 'Sydney', 'Chengdu', 'Zürich',
 'Shenzhen', 'Nanjing', 'Helsinki', 'Hong Kong', 'Stockholm', 'Rome',
 'Los Angeles', 'London', 'Barcelona', 'Seoul', 'New York', 'Dublin',
 'Cape Town', 'Johannesburg', 'Guangzhou', 'Brussels', 'Lisbon', 'Moscow',
 'Warsaw', 'Shanghai', 'Milan', 'Amsterdam', 'San Francisco', 'Toronto',
 'Montréal', 'Austin', 'Paris', 'Buenos Aires', 'Melbourne', 'Tokyo',
 'Taipei', 'Bogotá', 'Istanbul']

Loop through the dict keys and print the keys and values. Look for an example of this in the in-class exercises or the slides. Although you are looping through the list, you can use each of the strings in the list to obtain the value for that string, by subscripting the dict with the string. Your output should match that shown in the homework assignment.

 
5.6 (Extra credit.) Summing dictionary. Reading FF_abbreviated.txt, build a dictionary that sums all of the Mkt-RF values (the 2nd value, or leftmost float value) associated with each year. Sort the dictionary's keys by value and print each key and corresponding value, so that the values sort ascending. Do not loop through the data more than once.
Sample program run:
1926:  3.39
1928:  3.88
1927:  4.67

Discussion This project builds a "summing dictionary" which calculates a separate sum of Mkt-RF values (the 2nd value, or leftmost float value) for each year in the Fama-French file. We call this kind of grouping of values under a unique key (i.e., each year) an "aggregation". In this case, the year will be the unique key in the dict, and the value for that key will be a sum of all Mkt-RF values (first float column) found for that year. An aggregation is a powerful and very common analysis technique. In aggregations in other data sets we might sum up total revenue by client, count the number of births by city, calculate average salary by gender, etc. Step-by-Step. Once we have loaded the dictionary with the sum of the Mkt-RF values for each year, we will: 1. Build the dict

  • Read from FF_abbreviated.txt line-by-line
  • For each line, slice out the 4-digit year and split out and convert to float the 1st float value
  • If the year is not a key in the dictionary, set the year as key and 0 as value in the dict
  • Add the float value (split out and converted as noted above) to the value currently associated with this year key


Similar to the earlier assignment (and other container-building assignments), we will be looping through the source file line-by-line, slicing out the 4-digit year and splitting out the Mkt-RF value (the 2nd value, or leftmost float value) from each line.

For example, if we had just these lines in the file (note the years in each line):
19260701    1.0    0.22    0.30   0.009
19260702    2.0    0.35    0.08   0.009
19270103   10.0    0.21    0.24   0.010
19270104   20.0    0.15    0.73   0.010
Then by the end, the dict would have these keys and values:
{'1926': 3.0, '1927': 30.0}

To understand this result, note that the 1926 floats total 3.0 and the 1927 floats total 30.0. So you can see that the dictionary can be used to sum up values for each year. We use the key to indicate which total we're summing, and the value for that key to hold the sum. How can a dictionary be used to sum up values under a particular key? A summing dict checks to see if the year is in the dict. If it is not, it adds the year and float value from the line to the dict. But if it is already in the dict, it merely adds the float value from the line to the value for the year that is already in the dict. Here is a line-by-line breakdown of this concept, again considering the 4-line file above:

When the for loops reads the the first line:
19260701    1.0    0.22    0.30   0.009

The program checks to see if '1926' is a key in the dict, using the in operator (if year not in dict). Since it is not (the dict is currently empty) the program adds the key 1926 and value 1.0 to the dictionary.

At the end of the first iteration, then, the dict will contain:
{ '1926': 1.0 }
When the for loop reads the second line:
19260702    2.0    0.35    0.08   0.009

The program checks to see if '1926' is a key in the dict. Since it is already in the dict (we added it in the previous iteration) we add 2.0 to 1.0 to make 3.0. This value 3.0 is then associated with the same year key. In other words, we're replacing the original value for '1926' (1.0) with a new summed value (3.0).

The operative code line for adding new value to an existing value in the dict is this:
sumdict[year] = sumdict[year] + mkt_rf   # here, mkt_rf is the
                                         # current Mkt-RF value

Basically, the above says: "let the value for 1926 in the dict be associated with the current value for 1926 (1.0) plus the current value from this line (2.0)". So, as you loop, you will need to check the dict ahead of time to see if the year key is already there. If not, set the key and value in the dict for that line.

if year not in sumdict:
    sumdict[year] = mktrf
else:
    sumdict[year] = sumdict[year] + mkt_rf   # here, mkt_rf is the
                                             # current Mkt-RF value

In the next iteration, the year is 1927. This key is not in the dict, so we will add it to the dict with the current float value (10.0). The same process happens for the two lines with 1927 as they did for 1926.

The built (but unsorted) dict will look like this:
{'1926': 3.39, '1927': 4.67, '1928': 3.88}


2. Sort the dict keys; loop through and display key and value

  • When the loop is done, sort the dict keys by value (use the dict .get method as the key= argument) to generate a list of keys
  • Loop through the list of sorted keys and print out the year and value as shown in the sample run
  Use sorted() to sort the dict by value by using the key=mydict.get argument (where mydict is the name of the dict). As you know, when we sort a dict with sorted() we get back a list of sorted keys. If you loop through the sorted list of keys, you'll see the years printed, sorted by value:

1926
1928
1927

To print the year keys and summed values, loop through the sorted list of years and print the year; and then use that year to get the value for that key from the dictionary.

 
[pr]