Projects, Session 1

Advanced Python
Projects, Session 1

Please note: notations (as used in the Introduction to Python course) are no longer required.

SPECIAL NOTE ON OPENING FILES You may have noticed that the filenames we are using in the Inclass Exercises begin with a '../'. This specifies that the file can be found in the parent directory, or directory that is the parent or "one above" the folder our script is in. Any path information that precedes a filename (as long as it does not begin with C:/ or other drive letter (Windows) or / (Mac) is called a relative path. Not all relative paths will specify the parent directory as ../ does. To find a file using a relative path, we must know a) the location of the file, and b) the location from which we are running our script (the "present working directory"); from this we can determine c) the path needed to access the file location from the pwd. If Python can't find your file, it may be because the relative path is incorrect.

If the file you want to open is in the same directory as the script you're executing, use the filename alone:

fh = open('filename.txt')

If the file you want to open is in the parent directory from the script you're executing, use the filename with ../:

fh = open('../filename.txt')

If the file you want to open is in a child directory from the script you're executing, use the filename with the child directory name prepended:

fh = open('childdir/filename.txt')

(Above assumes the directory where the file is located is named childdir and that this directory is in the same directory as the script you're executing.)

1.1

Writing CSV rows with selected columns. Please create program select_columns.py that accepts two arguments: a source filename (for reading) and a target filename (for writing). The program selects the date, mean_temp, precip and events columns from the source file and writes them to the target file.

(this command is meant to be all on one line)

(base) david@192 ~ % python select_columns.py --source weather_newyork_t
iny.csv --target weather_newyork_narrow.csv

Please note that the text above up to and including the '%' is meant to represent your computer's Terminal or Command Prompt; yours will look different. Instead of a '%' you may see a '$' or a '>' just before where the command will be typed. Note that nargs= should not be used for --source or --target as only one argument is expected.

With the above arguments, weather_newyork_narrow.csv should have the following text (note that the events column value ('Rain', 'Fog-Rain', etc.) is sometimes missing in some rows):

date,mean_temp,precip,events
1/1/16,38,0,
2/1/16,52,0.01,Rain
3/1/16,46,0,
4/1/16,70,0.01,Rain
5/1/16,48,0.16,Rain
6/1/16,75,0,
7/1/16,73,0.83,Fog-Rain
8/1/16,75,T,Rain
9/1/16,74,0.5,Fog-Rain
10/1/16,59,0,
11/1/16,49,0,
12/1/16,48,0.07,Rain

Please note: opening a file for writing will truncate (or zero out) the file if it exists. This works fine for us as long as the source file and target file are different - the target file will be rewritten each time you run the program. Also please remember that you will not see writes to a file until you close the target file with wfh.close() or until the program ends execution. This is particularly critical with Jupyter as it is essentially a program that continues to run while the notebook is open. Lastly, if you are not sure where to start or you get stuck, the Homework Project Discussion document can provide additional guidance.
Extra challenge (not required): to create a reusable program -- to make the program reusable with any CSV file, add the --columns argument, which selects the columns read:

(this command is meant to be all on one line)

(base) david@192 ~ % python select_columns.py --source weather_newyork_t
iny.csv --target weather_newyork_narrow.csv --columns date mean_temp precip events

The above command should output the same resulting data as indicated above.

A second execution to the script (all one line) with a different file and columns will write only those columns:

(base) david@192 ~ % python select_columns.py --source ad_buys.csv --ta
rget ad_buys_narrow.csv --columns datetime buyer_id volume

With the above arguments, ad_buys_narrow.csv should have the following text:

datetime,buyer_id,volume
1/14/18 8:33,1,20
1/14/18 8:38,2,10
1/14/18 8:41,1,8
1/14/18 8:48,1,12
1/14/18 8:59,2,14
1/14/18 9:02,4,3
1/14/18 9:06,2,6
1/14/18 9:09,1,18
1/14/18 9:15,4,16
1/14/18 9:18,1,6
1/14/18 9:22,2,14
1/14/18 9:28,6,2
1/14/18 9:38,2,10
1/14/18 9:39,4,3
1/14/18 9:41,1,8

HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		the nargs argument is used only for those arguments that allow multiple arguments. If only one value is expected, do not use nargs, then you can read the value directly from the attribute, for example args.source
		the program uses the csv module to manage the commas, it does not use .split(',') or .splitlines() and it does not count commas or take any steps to parse the line outside of use of the csv module
		the program does not refer to any directory (e.g. ../ or other directory) -- if the file is in a different directory from the script, please include the relative path in the argument when calling the program, for example --source ../ad_buys.csv. In other words, the user of the program should be responsible for passing the correct path to the program, and the program should have no knowledge of where the file is located
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

1.2

Writing CSV rows with selected rows based on row value. Please write a program select_rows.py that accepts four arguments: a source filename (for reading), a target filename (for writing), a column name (that will be evaluated to select rows) and a field value (the matching value).

(this command is meant to be all on one line.)

(base) david@192 ~ % python select_rows.py --source revenue.csv --ta
rget revenue_ny.csv --column state --value NY

Please note that the text above up to and including the '%' is meant to represent your computer's Terminal or Command Prompt; yours will look different. Instead of a '%' you may see a '$' or a '>' just before where the command will be typed. When complete, revenue_ny.csv should have the following text:

Expected Output:

company,state,price
Hipster's,NY,11.98
Dothraki Fashions,NY,5.98
The Clothiers,NY,115.2

A second call to the script (all one line) with a different file and arguments should write different rows:

(base) david@192 ~ % python select_rows.py --source ad_buys.csv --ta
rget ad_buys_selected.csv --column buyer_id --value 1

Please note that the text above up to and including the '%' is meant to represent your computer's Terminal or Command Prompt; yours will look different. Instead of a '%' you may see a '$' or a '>' just before where the command will be typed. With the above arguments, ad_buys_selected.csv should have the following text:

Expected Output:

datetime,buyer_id,seller_id,volume,price
1/14/18 8:33,1,2,20,0.083
1/14/18 8:41,1,2,8,0.061
1/14/18 8:48,1,4,12,0.13
1/14/18 9:09,1,4,18,0.14
1/14/18 9:18,1,2,6,0.043
1/14/18 9:41,1,2,8,0.061

Please note: opening a file for writing will truncate (or zero out) the file if it exists. This is ideal as long as the source file and the target file are different. Also please remember that you will not see writes to a file until you close the file with fh.close() or until the program ends execution. This is particularly critical with Jupyter as it is essentially a running program.
HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		the nargs argument is used only for those arguments that allow multiple arguments. If only one value is expected, do not use nargs, then you can read the value directly from the attribute, for example args.source
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		header is included in new file
		the comparison to determine if this is a 'NY' row is made to a single field, not to the entire row (i.e. using == and not in
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

EXTRA CREDIT

1.3

(extra credit / supplementary) Build a dict aggregation. Count the number of occurrences in each word of sonnet_xv.txt. The words must be lowercased and stripped of punctuation before being included in the dictionary. (The csv module is inappropriate for this solution.)

The string.punctuation module variable can be used to obtain a string of punctuation characters. Use it with the rstrip() method: word.rstrip(string.punctuation) In the below output, confirm that 'in' (the 9th word in the dict below) was found 5 times and 'new' (at the end of the dict) was found 1 time.

Expected Output:

{'when': 2, 'i': 3, 'consider': 1, 'every': 1, 'thing': 1, 'that': 3,
 'grows': 1, 'holds': 1, 'in': 5, 'perfection': 1, 'but': 2, 'a': 1,
 'little': 1, 'moment': 1, 'this': 2, 'huge': 1, 'stage': 1,
 'presenteth': 1, 'nought': 1, 'shows': 1, 'whereon': 1, 'the': 3,
 'stars': 1, 'secret': 1, 'influence': 1, 'comment': 1,
 'perceive': 1, 'men': 1, 'as': 2, 'plants': 1, 'increase': 1,
 'cheered': 1, 'and': 3, "cheque'd": 1, 'even': 1, 'by': 1,
 'self-same': 1, 'sky': 1, 'vaunt': 1, 'their': 2, 'youthful': 1,
 'sap': 1, 'at': 1, 'height': 1, 'decrease': 1, 'wear': 1, 'brave': 1,
 'state': 1, 'out': 1, 'of': 4, 'memory': 1, 'then': 1, 'conceit': 1,
 'inconstant': 1, 'stay': 1, 'sets': 1, 'you': 4, 'most': 1,
 'rich': 1, 'youth': 2, 'before': 1, 'my': 1, 'sight': 1, 'where': 1,
 'wasteful': 1, 'time': 2, 'debateth': 1, 'with': 2, 'decay': 1,
 'to': 2, 'change': 1, 'your': 1, 'day': 1, 'sullied': 1, 'night': 1,
 'all': 1, 'war': 1, 'for': 1, 'love': 1, 'he': 1, 'takes': 1,
 'from': 1, 'engraft': 1, 'new': 1}

After the dict is built and verified, sort the dict keys by value in reverse order. Show only those words that have a count of more than 1.

Expected Output:

in:  5
of:  4
you:  4
i:  3
that:  3
the:  3
and:  3
when:  2
but:  2
this:  2
as:  2
their:  2
youth:  2
time:  2
with:  2
to:  2

HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		program does not use the csv module
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

1.4

(extra credit / supplementary) Split out file rows into multiple files. Reading the file we used for the csv module (revenue.csv), build a separate CSV file for each state in the source file (NY, NJ and PA), effectively selecting out the rows for each state into a separate file. At the end the program will have written three CSV files titled revenue_ny.csv, revenue_pa.csv and revenue_nj.csv with the header and rows for that state (as well as the header) written to each.

Avoid code repetition, in which the same operations are done multiple times. You should not have your code "do all these steps for NY" then "do all these steps for NJ", etc. Additional Requirements: to fully master data management, you must follow these additional restrictions:

do not loop through file more than once
do not refer to the states (NY, NJ, PA) inside your code
build the data into a single data structure, then write the structure to files

To accomplish the above we will use a dict of lists, which will be built up as it loops through the data, so any state encountered can be written to a file of that name.

When done, the dict of lists will look like this:

filelists = { 'NY': ['line1', 'line2', 'line3'...],
              'NJ': ['line1', 'line2', 'line3'...],
              'PA': ['line1', 'line2', 'line3'...]  }

It will not establish the dict with keys at the start. As mentioned, it will not refer to the states by name. Instead, it will loop through the data and when it encounters the first state name, will add a key to filelists with a list value, then append the first line to the list for that key. As it continues to encounter state names, it uses each name as key in filelists and appends to its list. Finally after the file parsing is complete, it will loop through the dict keys, open a csv file for writing based on the dict key name, and write the header and lines to the file. Challenging, but absolutely where you want to be going in terms of Python ETL!!

1.5

(extra credit / supplementary) Table aggregation and join. Build an aggregating dict to sum up volume by buyer_id in the ad_buys.csv table, and then join the result to the company_id in ad_companies.csv so that the company name is included in the result. Build a new CSV as shown:

Expected Output:

company_id,company_name,volume
1,Alpha Corp,72
2,Mike's Ads,54
4,Jones Kraft,22

[pr]