Introduction to Python

davidbpython.com




Project Discussion, Session 4



4.1 Notes typing assignment. Please write out this week's transcription notes.
 
4.2 Filepaths Exercises.

Notations are not required for this solution. First, you must not use an absolute path (like those that start with a C:\ or other drive letter, or those that start with a slash (/). These can be used to correctly locate a file, but we are here to learn how to construct relative paths). A relative path is one that locates a file relative to the location of the script. Therefore, we must first consider the location of the script within the filepath. There are four relative locations to consider:

  1. file in same directoryas script: simply use the filename, no path information
  2. file in directory "above" script (for example, dir2a is one level above dir1): precede the filename with the directory name where the file is located and a slash, i.e. dir2a/. If the directory is multiple levels above, "chain" them into a path (for example, to get to a file in dir3a from dir1, the path is dir2a/dir3a/)
  3. file in directory "below" script (for example, dir1 is one level below dir2a): precede the filename with the "parent" directory shortcut and a slash, i.e. ../. If the directory is multiple levels down, "chain" them into a path (for example, to get to a file in dir1 from dir3a, the path is ../../
  4. file in "sibling" directory as script (for example, dir2a and dir2b are at the same directory level): use the .. to go "down" to the parent directory where the target directory is located, then add the target directory name (for example, to get to a file in dir2a from dir2b, the path is ../dir2a/)

Troubleshooting:

  • make sure all names are correctly spelled - they are very similar, so take extra care when reviewing
  • make sure you are including a slash between each folder and file, so ..dir1 should be ../dir1
  • make sure the path is relative, so it must not start with a slash or drive letter
  • check and recheck the relative location between the script you are running and the target file
  • if having difficulty reaching a file that is multiple levels away, you can try to "walk" your way to the file by opening a closer file on the path

 
4.3 Take user input for a 4-digit year and exit if incorrect.

Please use the input() function to take user input. Use an if/else expression with a compound test to see: - the input is not 4 characters long OR - the input is not all digits If it is not either of the above, exit() with an appropriate error message. Otherwise, print 'input validated' and the value. (Alternatively, you could reverse the logic and say "if the argument is 4 characters AND is all digits, print 'input validated', else exit" -- either approach is fine.) Note: a compound test is much better than two 'if' statements because we then have one test and one outcome, so it is most appropriate. It's recommended not to use a while True: loop. Here let's work with if/else this time. Test this step: make sure the script shows an error with the following inputs:

19722
197a
197
abcd
 
4.4 Compile a list of Mkt-RF float values (the 2nd column, or leftmost float values) for a given year from the file FF_tiny.txt

This is just like the file parser we did in the previous session, but instead of a float summer and integer counter, we are collecting float values in a list. You'll do the same work you did in that previous assignment.

Start with a 4-digit string year (for example, '1927') assigned to a string variable, and an empty "collector" list:
user_year = '1927'
mktrf_list = []
Loop through the FF_tiny.txt file, performing much the same operations as we did previously:
open file

loop through file line-by-line:
    slice out the year from the line
    if the sliced year is equal to user_year:
        split the line
        subscript the mktrf value (the 2nd value, or leftmost float value) from the list
        convert the value to float
        append the value to the collector list

print the resulting list

If you see the statement printed multiple times, keep in mind that statements inside the loop will happen many times, while statements outside the loop will happen only one time. If your output has only one item, you can ask this same question. Where am I initializing the list - inside or outside the loop? Where am I appending to the list - inside or outside the loop?

 
4.5 Calculate sum, count, max, min, average and (optionally) median from a list of values.

You'll do all of this using the 'summary' functions highlighted this week. Please don't loop. We summarized data using a counter and loop last week - this week we're highlighting the summary functions, sum(), len(), min() and max(). Use an f'' string to print these values in the string as specified.

 
4.6 Combine the previous three assignments into one complete program, and then test that they work together with the FF_data.txt file.

Again, you must take care not to rewrite, combine or "nest" any of your code solutions within another - they are to follow one another consecutively, one after the other. The reason for this is simple - they are separate steps that don't logically depend on one another. However, you will need to remove hard-coded values, the 'input validated' message, and may need to change the variable names so that variables are shared between the steps, but you won't need to rewrite them. Special note on "no values found" The one original addition here is the test that will announce "no values for found for year YYYY" (where YYYY is the user's year). You will determine this by testing the length of the collector list. If after going through the entire file no year matched, the list will be empty. Test to see if the length of the list is 0 -- if so, print the message and exit.

In our data collection portion we noted that if the year wasn't in the data, we would end up with an empty list. Please use this as the indication of whether or not the year is correct:
if len(collected_values) == 0:
    print(f'no values found for year {user_year}')
    exit()

The above assumes that the name of your list variable is collected_values and the name of your user's input year is user_year. Please do not attempt to loop through the file to see if the year is in the file. Please use the list (or empty list) to determine if any values were collected.

 
4.7 Show unique years in FF_data.txt:

Reading from the dates in the left-hand column of FF_data.txt, compile a sorted list of unique 4-digit years. Use a set() to build the collection of years, then use sorted() to sort them into a list. Print the list and the number of years found in the list.

To initialize an empty set, you must use the set() function, because empty curly braces will signify a dict):
unique_years = set()

This solution requires the same algorithm that we used in the data collection in the previous assignment:

initialize an empty set

loop through the data
    slice the 4-digit year
    add the year to the set

sort the set into a list
get the len of the list
print the list
print the len of the list

Don't forget that sorted() returns a list, so there is no need to explicitly convert to list. Make sure you aren't sorting the set or getting the len inside the loop - these operations are to be done one time, not many times, so we shouldn't be asking Python to do them more than once, even if the result is the same.

 

EXTRA CREDIT / SUPPLEMENTARY EXERCISES

 
4.8 (Extra credit / supplementary.) Please see assignment for discussion.
 
4.9 (Extra credit / supplementary.) wc emulation. The overall purpose of this assignment is to get you to think of a file as a whole string, a list of strings, and as a list of words. To determine the number of characters in the file, you need only think of the file as a string. To determine the number of lines in the file, you need only think of the file as a list of lines. To determine the number of words in the file, you need only think of the file as a list of words.

How can we think of the file as a string? file.read() returns the file contents as a string. We can use len() on a string to see how many characters are in the string. How can we think of the file as a list of lines? str.splitlines() returns the string split on the newline character, i.e. a list of lines. We can use len() on a list to see how many elements are in the list. How can we think of the file as a list of words? str.split() returns a string split on whitespace, i.e. a list of words. We can use len() on the list to see how many words are in the list.

 
4.10 (Extra credit / supplementary.) Spell checker: check each word in a file against a set of spelling words.

Overview The purpose of the assignment is to acquaint you with set membership tests (using in). This program is a practical solution implementing functionality familiar to all of us (i.e., most of us have seen a spell checker); it can be used on any file. Testing membership (using in) is another core analysis technique, and one of the primary uses of a set. Loading spelling words into a set from the words.txt file (which contains a dictionary of 25,225 words, one per line) and then reading the file sawyer.txt line-by-line and splitting and looping through the words from each line, print all misspelled words (i.e., words in sawyer.txt not found in the set built up from words.txt) as well as the line number where each was found. You will have to rstrip() all punctuation (comma, semicolon, period) and lowercase each word you handle. The spelling words should be loaded into a set, but the sawyer.txt words should not be loaded into a container structure -- we should simply loop through the file line-by-line, and then for each line, split the line into words and loop through each word line-by-line. We can simply report the misspelled word; we do not need to add the word to another list or other container. Keep track of the line of the file by looping through each line individually, splitting each line into words and, still inside the line-by-line for loop, looping through each word (i.e., please don't use read().split() on sawyer.txt. Step-by-step We want to think of our program in terms of discrete steps. 1. Loop through the words.txt file line-by-line (word-by-word) strip each word of whitespace and load spelling words into a set.

Since the file contains only one word per line, a great one-liner for splitting the file into words (that also removes all newlines) is this:
words = open(filename).read().splitlines()

Test this step: print the whole set. This will deliver a huge output, but the end of the output should show you that each word has been stripped of newline (you'll see quotes around the word but no newline character). 2. Report the number of words in the set. Of course you know you can use len() to learn the size of a container. Test this step: print the len() of the set: my count was 25225. If yours is different, ask whether you did any processing to each word, like lowercasing. I did not lowercase my spelling words to the set. 3. Report misspelled words: looping through the test file line-by-line, split each line into words and loop through the list of words; for each word in the split list, strip the word of newline and punctuation and lowercase the word; then check to see if it is in the set. If it is not, report it and the line count for each found word.

open the file
set line counter to 0
looping through the file line-by-line
    increment line counter
    split line into list of words
    loop through list word-by-word
        STRIP AND LOWERCASE THE WORD
        if the word is not in the set:
            print line number and word

This step has proved to be among the most challenging that we've seen so far, principally because of the 2-dimensional looping required. To get started, I recommend working with a smaller file. The output of a larger file can be confusing. Therefore, use pyku.txt. Its contents are easily recognizable (make sure to take a quick look at it before proceeding) and output will not get long and confusing.

A. Begin by looping through pyku.txt line-by-line. You should see each line printed separately. If you see a blank line between each printed line, the newline character is still there. Since we're planning to strip each word, this wouldn't be an issue, but for the purposes of clear testing results I suggest you strip each line as you loop.
We're out of gouda.
This parrot has ceased to be.
Spam, spam, spam, spam, spam.
B. Inside the loop, use the str method .split() to split the line on whitespace (no argument to split()), returning a list. Assign the list to a variable name and print each list.
["We're", 'out', 'of', 'gouda.']
['This', 'parrot', 'has', 'ceased', 'to', 'be.']
['Spam,', 'spam,', 'spam,', 'spam,', 'spam.']
C. Still inside the loop, instead of printing each split list, loop through and print each word.
We're
out
of
gouda.
This
parrot
has
ceased
to
be.
Spam,
spam,
spam,
spam,
spam.

D. Before printing each word, lowercase and strip and print the word.

we're
out
of
gouda
this
parrot
has
ceased
to
be
spam
spam
spam
spam
spam

For stripping words, rstrip() is quite versatile: when used with no argument, it removes "whitespace" (spaces, tabs and newlines) from the right side of the word; when used with a string argument, removes any of a string of characters from the right side of the word.

Stripping punctuation does not need to be done in more than one call. For example, to strip either a question mark or an exclamation point from a string, we can use a single string with both characters:
aa = 'hello!'
bb = 'hello?'

print aa.rstrip('!?')   # hello
print bb.rstrip('!?')   # hello

5. Add a counter. Before the loop begins, set a variable called line_count to 0. As you are looping through the file line-by-line, increment this variable. As you loop through each word, print the line number on which that word was found.

Your print statement can be as simple as:
print(line_count, word)
1 we're
1 out
1 of
1 gouda
2 this
2 parrot
2 has
2 ceased
2 to
2 be
3 spam
3 spam
3 spam
3 spam
3 spam


6. Substitute sawyer.txt and see if the output makes sense. There will be a lot more output, but you should at least see that the first or last line has the right line number and all of the words from that line. 7. Complete the program by including the in test and report of misspelling. As you know, the way to test whether one word can be found in a set() of words is with in.

if word not in myset:
    print(f'misspelled word on line {line_count}:  {word}')

Good luck! Send questions!

 
[pr]