Introduction to Python

davidbpython.com




Projects, Session 4



PLEASE REMEMBER:

  1. re-read the assignment before submitting
  2. go through the checklist including the tests
  3. make sure your notations are as specified in the homework instructions

All requirements are detailed in the homework instructions document.

Careless omissions will result in reductions to your solution grade.

 

NOTE ON OPENING FILES This week we learned about absolute and relative paths: to find a file using a relative path, we must know a) the location of the file, and b) the location from which we are running our script (the "present working directory") to determine c) the path needed to access the file location from the pwd. If Python can't find your file, it may be because the relative path is incorrect.

If the file you want to open is in the same directory as the script you're executing, use the filename alone:
fh = open('filename.txt')
If the file you want to open is in the parent directory from the script you're executing, use the filename with ../:
fh = open('../filename.txt')
If the file you want to open is in a child directory from the script you're executing, use the filename with the child directory name prepended:
fh = open('<childdir>/filename.txt')

(Replace <childdir> with the name of the child directory.)

 
4.1 Notes typing assignment. Please write out this week's transcription notes. The notes are displayed as an image named transcription in each week's project files folder.

This does not need to be in a Python program - you can use a simple text file.

 
4.2 Filepaths Exercises.

As usual, returned solutions will lose points. It is recommended to confirm (through testing) that your answer is correct before submitting to ensure that you will receive credit.

Start with the below file tree, which is available in this week's data folder:
dir1
├── file1.txt
├── test1.py
│
├── dir2a
│   ├── file2a.txt
│   ├── test2a.py
│   │
│   └── dir3a
│       ├── file3a.txt
│       ├── test3a.py
│       │
│       └── dir4
│           ├── file4.txt
│           └── test4.py
└── dir2b
    ├── file2b.txt
    ├── test2b.py
    │
    └── dir3b
       ├── file3b.txt
       └── test3b.py

To complete this assignment, please open and edit each of the below 5 .py scripts in the tree so that they open the noted .txt files (please do not move, copy or recreate any of the files -- they must be modified and run where they are located in the tree):

  1. test4.py: open and read file4.txt
  2. test2b.py: open and read file3b.txt
  3. test1.py: open and read file3a.txt
  4. test2a.py: open and read file1.txt
  5. test3a.py: also open and read file1.txt

Your job is to fill in the relative filepath (i.e. not starting with C:\Users or /Users) needed to open the indicated file in the open(r'') function call in each script.
Test (i.e., run) each of the above scripts to verify that they open and read the indicated file.
Finally, copy out the paths that you used to open each file as indicated below. (The first has been done for you.)
Keep in mind, this is not a program to be run! Simply fill in the open() filepaths you used to read each .txt file from the indicated .py file.

######## test4.py:  read file4.txt ########

fh = open('file4.txt')   # this path has been completed for you


######## test2b.py:  read file3b.txt ########

fh = open('')              # add relative filepath here to open file3b.txt


######## test1.py:  read file3a.txt ########

fh = open('')              # add relative filepath here to open file3a.txt


######## test2a.py:  read file1.txt ########

fh = open('')              # add relative filepath here to open file1.txt


######## test3a.py:  read file1.txt ########


fh = open('')              # add relative filepath here to open file1.txt

See "Filepaths for Locating Files" slide deck this session for a discussion to assist in completing this assignment. Send me any questions you may have.

 
4.3 Take user input for a 4-digit year and exit if incorrect.

The program can use an if/else: exit() if the value is bad (not 4 chars and not alldigits), and if good print 'Input validated' and the 4-digit input. Use a compound test: if not 4 characters or not all digits, then exit. SPECIAL NOTE: Please do not use a while True: loop. Here let's work with if/else.

Sample program run:
please enter a 4-digit year: 234a
sorry, must be 4 digits      [ program exits here ]
Sample program run:
please enter a 4-digit year: 234
sorry, must be 4 digits      [ program exits here ]
Sample program run:
please enter a 4-digit year: 2349
input validated:  2349

Note that even if the year isn't a real year, the program still validates it - the test is len() of 4, and all digits. HOMEWORK CHECKLIST: all points are required


    Since program uses exit() to terminate if the user's input is incorrect, this program does not use a while True block. while True is for repeating the same action multiple times. If the program doesn't do that, it should not be using while True.

    testing: you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

    when testing for bad input (must be 4-characters long and must be all digits), program uses a compound test with 'or' (if string is not length of 4 or string is not all digits) rather than using an 'if' block nested inside another 'if' block (nested 'if' blocks are inherently more complex logic)

    program does not compare the user's year to 1000 or 9999 -- it uses the len() of the user year to determine if it is 4 characters

    there are no extraneous comments or "testing" code lines

    program follows all recommendations in the "Code Quality" handout

 
4.4 Compile a list of Mkt-RF float values (the 2nd column, or leftmost float values) for a given year from the file FF_tiny.txt.
Start with a 4-digit string year (for example, '1927') assigned to a string variable:
year = '1927'
Initialize an empty "collector" list.
mktrf_list = []

Then looping through the FF_tiny.txt file, collect a list of MktRF values (the 2nd column, or leftmost float values) for that year. (Note the below output is from FF_tiny.txt)

Sample run (with year == '1927')
[0.97, 0.3, 0.0, 0.72]
Sample run (with year == '1926')
[0.09, 0.44]
Sample run (with year == '1928')
[0.43, 0.14, 0.71]

Next, please set the year to '9999' and make sure that the program prints out a empty list.

Sample run (with year = '9999')
[]

When we combine the programs later, this empty list will be the key to determining if the user may have input a year that doesn't exist in the data. Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in the Session 4 Slides. HOMEWORK CHECKLIST: all points are required


    testing: you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

    the program uses .split() to get the float value from the line, not slice (which should be used to get the 4-digit year from the line)

    the program produces a list of floats, not a list of strings

    actions that need to be done repetitively are located inside the loop - for example, appending each value to the list

    actions that need to be done only once are located outside the loop - for example, initializing the empty list or printing the list

    the code is not including any lines or variables that aren't used for the purposes of this program

    there are no extraneous comments or "testing" code lines

    program follows all recommendations in the "Code Quality" handout

 
4.5 Calculate sum, count, max, min, average and (optionally) median from a list of values.

Note: this assignment does not read from the file, loop, or gather values - it only uses the hard-coded values shown below. We will later combine this program with the data gathering solution to create a complete program.

Start with this sample list of values from 1927:
user_year = '1927'
mktrf_list = [0.97, 0.30, 0.00, 0.72]

Use the "summary functions" we looked at this week to calculate the sum, count, max, min and average. Round the average to 2 decimal places. Print the year along with the results that you calculated using the summary functions:

Sample program run:
1927 (Mkt-RF):  4 values, max 0.97, min 0.0, avg 0.5

Next, please test your program with a different list of values: there is no need to write out the logic again or create a new program - simply change the below two variables and test the code that you wrote for the above.

Reassign the below two variables with values from 1928:
user_year = '1928'
mktrf_list = [0.43, 0.14, 0.71]

Print the year along with the results that you calculated using the summary functions:

Sample program run:
1928 (Mkt-RF):  3 values, max 0.71, min 0.14, avg 0.43

Extra credit / supplementary: calculate the median.

  • calculate the "median" or "middle" value in a sorted list
  • in an odd-numbered list of values (3, 5, etc.), the median is the one in the middle of the sorted list
  • in an even-numbered list of values, the median is average of the two middle values in a sorted list

  • mktrf_1927 = [0.97, 0.30, 0.00, 0.72]  # median is 0.51
                                           # (average of 0.30 and 0.72)
    
    mktrf_1928 = [0.43, 0.14, 0.71]        # median is 0.43 (middle value in sorted list)
    

    You must not hard-code the index of the middle value(s)! This index must be calculated. HOMEWORK CHECKLIST: all points are required


        testing:you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

        the code doesn't loop over the list of values - only the 'summary functions' are needed for this calculation (this is part of the purpose of the project - using the summary functions to calculate a count, sum and average)

        if doing the "median" extra credit, program calculates the index position of the median value(s), it does not hard-code this index

        the code is not including any lines or variables that aren't used for the purposes of this program

        there are no extraneous comments or "testing" code lines

        program follows all recommendations in the "Code Quality" handout

     
    4.6 PASTE IN THE CODE, ONE AFTER THE OTHER for the previous three assignments into one complete program, and alter variable names so that they work together. (You'll also remove the hard-coded year from the 'for' loop section, and the hard-coded list of values from the average calculation section.)

    As we did last week, please paste in the 3 above solutions into a new program. Please do not combine or "nest" any code, as the solutions can work together without being mixed together -- they should follow one another. Remove the hard-coded values (as well as the 'input validated' message) and adjust the names of the variables so that the 3 code blocks can work together. For example, the year value taken in the user input section should be the same variable name as the year used in the data gathering section. This time, use the FF_data.txt file. Ensure that the program works as shown in the examples below:

    Sample program run:
    please enter a 4-digit year: 196
    sorry, must be 4 digits
    Sample program run:
    please enter a 4-digit year: help
    sorry, must be 4 digits
    Sample program run:
    please enter a 4-digit year: 1926
    1926 (Mkt-RF):  150 values, max 1.48, min -1.69, avg 0.05
    Sample program run:
    please enter a 4-digit year: 1972
    1972 (Mkt-RF):  251 values, max 1.38, min -1.45, avg 0.05
    Sample program run:
    please enter a 4-digit year: 1993
    1993 (Mkt-RF):  253 values, max 1.56, min -2.7, avg 0.03

    year not found in data (please note special approach to this error)

    Sample program run:
    please enter a 4-digit year:  9999
    no values found

    Special note on "no values found" The one original addition in this "combined" solution is the test that will announce "no values for found for year YYYY" (where YYYY is the user's year). You will determine this by testing the length of the collector list. If after going through the entire file no year matched, the list will be empty. Test to see if the length of the list is 0 -- if so, print the message and exit. Special note on input testing with if/else

    Some students (rather intuitively, I think) employ this logic:
    take input
    
    if input is 4 characters and all digits:
        print('input validated')
    
        # read file and calculate results (whole rest of program)
    
    else:
        exit('input bad')

    However, you are requested NOT TO PUT YOUR ENTIRE PROGRAM IN AN if OR else BLOCK. The reason for this has to do with our desire to separate steps in our code so that they are independent. If you put the whole program in the if you are creating a dependent relationship that is not needed. You are also pushing the else all the way to the bottom where it is hard to see the connection to the if.

    You can resolve this issue by handling the input validation logic first, then moving on.
    take input
    
    if input is 4 characters and all digits:
        print('input validated')
    
    else:
        exit('input bad')
    
    # read file and calculate results (whole rest of program)
    Or even shorter and cleaner (and thus much better), use a negative test:
    take input
    
    if input is not 4 characters or input is not all digits:
        exit('input bad')
    
    # read file and calculate results (whole rest of program)

    No 'else' is required using this logic. Less is more! HOMEWORK CHECKLIST: all points are required


        testing: you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

        the code solutions are pasted in one after the other -- they are not nested, mixed or combined

        when testing for bad input (must be 4-characters long and must be all digits, or input is "bad"), program uses a compound test with 'or' (if this is true or that is true) rather than using an 'if' block nested inside another 'if' block (nested 'if' blocks are inherently more complex logic)

        if the user's input is invalid (not 4 digits) the program calls exit() (unless you chose to retain a while True loop, in which case it would continue)

        program does not compare the user's year to 999, 1000, 9999 or 10000 -- it uses the len() of the user year to determine if it is 4 digits

        to determine whether the year is correct, program loops through the whole file first, then looks at the collected list to see if any items are there. It does not loop through the file more than once.

        the code doesn't use integers or floats to calculate the count or sum - only the list is needed for this calculation (this is part of the purpose of the project - using a list of values to calculate a count, sum and average)

        actions that need to be done repetitively are located inside the loop - for example, appending each value to the list

        actions that need to be done only once are located outside the loop - for example, calculating the sum, count and average from the list

        the code is not including any lines or variables that aren't used for the purposes of this program

        there are no extraneous comments or "testing" code lines

        program follows all recommendations in the "Code Quality" handout

     
    4.7 Show unique years in FF_data.txt:

    Reading from the dates in the left-hand column of FF_data.txt, compile a sorted list of unique 4-digit years. Use a set() to build the collection of years, then use sorted() to sort them into a list. Print the list and the number of years found in the list.

    To initialize an empty set, you must use the set() function:
    unique_years = set()
    

    Empty curly braces would signify a dict.

    Expected Output (please note this is abbreviated - we should see all years in the list below)
    ['1926', '1927', '1928' ...  '2010', '2011', '2012']
    87 unique years found

    HOMEWORK CHECKLIST: all points are required


        testing: you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

        actions that need to be done repetitively are located inside the loop - for example, adding each value to the set

        actions that need to be done only once are located outside the loop - for example, sorting the set or calculating the len of the list

        the code is not including any lines or variables that aren't used for the purposes of this program

        there are no extraneous comments or "testing" code lines

        program follows all recommendations in the "Code Quality" handout

     

    EXTRA CREDIT / SUPPLEMENTARY EXERCISES

     
    4.8 (Extra credit / supplementary.) Repeating and enhancing the float-collecting solution (the main program for this week), complete one or both of the following:

    Enhancement A: use the full FF_Research_Data_Factors_daily.txt file as data source.

    • use FF_Research_Data_Factors_daily.txt as data source (this file contains header lines and footer lines)
    • use "file slicing" (slicing a list of lines from the file) to isolate the data lines - do not use a counter, do not use a very large number (you can use a negative subscript in a slice) and do not test the line to see if it starts with all digits
    • loop through the lines and split out the Mkt-RF values (the 2nd column, or leftmost float values) from lines that match the selected year, and collect them in a list
    • once data looping is complete, compute the count, sum, average, max, min and (optionally) median, as in the for-credit assignment
    • pay close attention to 1926 or 2018 and make sure your calculations match -- see below)

    Sample program runs:
    please enter a 4-digit year:  1926
    1926 (Mkt-RF):  150 values, max 1.53, min -1.83, avg 0.05
    please enter a 4-digit year:  2018
    2018 (Mkt-RF):  82 values, max 2.67, min -4.03, avg 0.0

    Troubleshooting: if your 1926 and 2018 values are off, it may be because you have sliced the wrong number of lines. Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in the Session 4 Slides. Enhancement B: choose column to sum.

  • user inputs year and factor (Mkt-Rf, SMB, HML, RF)
  • program checks to see that factor is valid
  • program compiles a list from that column
  • make sure that you do not repeat code in your solution!

  • Sample program runs:
    enter a 4-digit year:  1900
    enter a factor to process:  Mkt-RF
    no values found
    enter a 4-digit year:  1972
    enter a factor to process:  Mkt-RF
    1972 (Mkt-RF):  251 values, avg 0.0486055776892
    enter a 4-digit year:  2001
    enter a factor to process:  SMB
    2001 (SMB):  248 values, avg [your calculated value here]
    enter a 4-digit year:  2001
    enter a factor to process:  XXL
    Sorry, that factor does not exist.

    HOWEVER, you must not repeat code in your solution! Instead, select an index based on factor location and use that index when selecting the value from the split list. (Note that the restriction not to repeat code does not mean that you can't repeat code from the for-credit solution -- it simply means to not repeat code within any one solution.)

     
    4.9 (Extra credit / supplementary.) wc emulation: wc is a unix utility program that counts the number of lines, words and characters in a given file.

    Reading from file sawyer.txt (found in the source data directory linked from the class website), print the number of lines, words and characters in the file. Note: when counting characters, include spaces and newlines as well. Please do not use a 'for' loop and do not open the file more than once.

    please enter a filename:  sawyer.txt
    20 lines
    270 words
    1440 characters

    Special note: we have seen anomalies that stem from varying line endings on the Windows and Mac platforms. You may get a count that is off mine by 1 or 2 lines, or up to 20 characters. If so, don't worry. It just needs to be within 20. Challenge 1: please do not open and read the file more than once. Challenge 2: you will be tempted to loop and count to get your answer, but you are challenged to get these counts without looping. See if you can get each count without actually using a loop, but by using len() on various "slice and dices" in the data. Do this without opening the file more than once (hint: read() will read the file into a string; split() will split a string into words; splitlines() will split a string on the newlines!). Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in the Session 4 Slides. Extra credit / supplementary: attempt to mirror the format of the original wc unix utility by right-justifying each value within an 8-character width (you can use the str method .rjust(), or an f'' string with :>10 inside the token -- see f'' strings in the 'object methods' slides from Session 2).

    please enter a filename:  sawyer.txt
          20     270    1435 sawyer.txt

    HOMEWORK CHECKLIST: all points are required


        testing: you have run the program with the sample inputs as shown and are seeing the output exactly as shown (contact me if your output is different and you're unable to adjust to match)

        the file is read only once

        the counts are obtained without the use of looping

        the program uses f'' strings to combine numbers with strings, or strings with other strings

        there are no extraneous comments or "testing" code lines

        program follows all recommendations in the "Code Quality" handout

     
    4.10 (Extra credit / supplementary.) Spell checker: check each word in a file against a set of spelling words.

  • load spelling words from words.txt into a set (make sure each word is stripped of newline before adding)
  • print the count of words in the set (count will be 25225, or 25185 if you lowercased each word before adding)
  • for loop through sawyer.txt line-by-line, counting each line as you loop
  • inside the for loop, split the line into a list of strings (words)
  • then while still inside the for loop, loop through the split list of words from the line (keep in mind: this loop is nested inside the file loop above)
  • "normalize" the word by lowercasing it and stripping it of punctuation
  • if any word in the line does not appear in word set, print word and line number
  • do not load the sawyer words into a list or set! The misspellings can be reported as they are found

  • Sample program run:
    25225 words in spelling words   (note this count will be 25185 if
                                     words are lowercased before adding)
    
    misspelled word on line 1:  russling
    misspelled word on line 4:  unconshiously
    misspelled word on line 6:  interlarded
    misspelled word on line 8:  minst
    misspelled word on line 14:  coattails
    misspelled word on line 16:  hhe
    misspelled word on line 18:  sentense
    misspelled word on line 20:  akt

    The spelling words should be loaded into a set, but the sawyer.txt words should not be loaded into a container (list or set) -- we should simply loop through the file line-by-line, and then for each line inside the loop, split the line into words and, still inside the for loop, loop through each word word-by-word. We can simply report the misspelled word; we do not need to add the word to another list or other structure. Note that if Python can't find your file, it may be because the relative path is incorrect. Please see "Filepaths for Locating Files" in the Session 4 Slides. HOMEWORK CHECKLIST: all points are required


        spelling words are loaded into a set, not a list -- a set is much more efficient for lookups

        in reading sawyer.txt, the program employs a loop within a loop - a 'for' loop to loop through the file line-by-line, then an "inner" 'for' loop to loop through each word in the line

        sawyer.txt is not loaded into a list or set, since this would destroy line number information. Instead, the file is read line-by-line, each line is split into words, and then another loop (inside the 'for' loop) loops through each word in the line

        there are no extraneous comments or "testing" code lines

        program follows all recommendations in the "Code Quality" handout

     
    [pr]