Projects, Session 4

Advanced Python
Projects, Session 4

REGULAR EXPRESSIONS

SPECIAL NOTE: please do not use the vertical bar with whole patterns ("if it matches this whole pattern, or it matches this whole pattern, or it matches this whole pattern", etc.) as I will return such solutions. We must learn how to write regexes that match on variations without doing this - usually using quantifiers. ALSO: my results could be incorrect or incomplete! Please don't get hung up on matching my results precisely. If there are differences, let's discuss how to explore them. LASTLY: there are some good tips at the start of the project discussion.

4.1

Post a .py file to your github account.

Please review the slide deck titled "Getting Started with Github" that we used in class to set up our github and establish the ssh public/private keys. Scroll down to the sections titled "Perform Your First Commit and Push" and "The git commit cycle". Now that your git is set up and you have started a git repository linked to github, please now follow the remaining instructions to:

create a new or change an existing file
"add" the file or file changes to the git repository
commit the changes to your local repo, and then
push the changes to the git repo

Once pushed, you should be able to view the changes on github through the browser. For your submission for this assignment, please submit *just the link* to your github account. This link is the URL that you see at the top of the browser when you're viewing your repo. I'll use the link to visit your repo and see that you have made a change to the repo. You may also want to note the change that you made, or just if your repo has a new file or the README has a notable change in it, I'll be able to see that you were successful. Please do not use the browser interface to effect any of these changes to the repo. It must be done through the git commit cycle as described in the slide deck. Let me know your questions, thanks!

4.2

Build a dict from a web server log that sums up web server usage by user id.

An NYU user id consists of 2-3 letters and 1 or more numbers. My user id is dbb212.

On each line, access_log.txt contains a log line describing a web request -- here is an example:

172.26.93.208 - - [28/Jun/2012:21:00:45 -0400] "GET /~cmk380/pythondata/image3a.txt HTTP/1.1" 200 4487

cmk380: the home directory (also the user id) for user hosting the requested resource 4487 (at the end of the line): the number of bytes downloaded in the request Not all lines have an NYU ID or a bytes number. If a line is missing either (i.e. if either of the patterns did not match), skip it. In addition, the pattern of 2-3 letters followed by one or more numbers may appear in other places, such as file or folder names. These are not user ids and should not be counted. To do this, and for the credit portion of this assignment include the "slash tilde" as part of the pattern so that only ids of this pattern are selected. Use a summing dictionary to sum up the bytes found for each user id, sort the dict by value and show the users sorted from greatest to least bytes downloaded. List only those users that have > 10000000 bytes.

Expected Output:

2235 matches found (both user id and end-of-line bytes found on the line)

shl249:  180275381
mm64:  92987649
jl2462:  60035526
myd212:  30060623
mrh382:  21893612
dbb212:  18108625
gb1067:  15346867
bnb224:  11175954
mhy229:  10937128

Extra Credit: find more references to nyu ids -- see below. HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		you do no post-processing -- that is splitting, slicing or otherwise modifying the extracted values after using the regex. Instead, use the proper grouping so that the value extracted is in final form.
		you do not use the 'wildcard' (.) -- this is a special restriction for these assignments only
		you use the +, * and ? (the built-in quantifiers) and not {1,}, {0,} or {0,1} (their custom quantifier equivalents) (This doesn't mean you shouldn't use custom quantifiers like {2,3}; it means that you should use + instead of {1,}, * instead of {0,} and ? instead of instead of {0,1})
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

4.3

Extract all mentions of stock symbols and price changes from a news article.

The news article in market_discussion.txt contains a number of references to stock symbols and pct change for the stock at that moment:

... according to The Wall Street Journal, SoftBank Group SFTBY, +0.65% didn’t have ...

The ticker will be one or more capital letters followed by a comma, and the price change will start with a sign (+ or -), have at least one digit after the sign, and have two decimal places after the period, followed by a percent sign. Please do not split or loop. Use fh.read() to read the file into a single string, then re.findall(), which will search through the entire text and return a list of strings with all matches. Then, please sort the resulting list, and loop through and print it.

Expected Output:

COMP, -0.44
CVS, +5.02
DJIA, -0.15
HPQ, +10.54
HUM, +3.42
SFTBY, +0.65
SPX, -0.12
TMUBMUSD, -2.52
XRX, +2.67

Extra Challenge: collect the results in a dict and sort by value, highest to lowest. To do this, you'll use .findall() with groups (grouping the ticker and the price), which returns a list of tuples. Again, please do not split the file or loop through lines or words. You must use .findall().

Extra challenge expected output:

HPQ, 10.54
CVS, 5.02
HUM, 3.42
XRX, 2.67
SFTBY, 0.65
SPX, -0.12
DJIA, -0.15
COMP, -0.44
TMUBMUSD, -2.52

HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		you do no post-processing -- that is splitting, slicing or otherwise modifying the extracted values after using the regex. Instead, use the proper grouping so that the value extracted is in final form.
		you do not use the 'wildcard' (.) -- this is a special restriction for these assignments only
		you use the +, * and ? (the built-in quantifiers) and not {1,}, {0,} or {0,1} (their custom quantifier equivalents)
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

4.4

(extra credit) Find user ids in the web server access log that do not conform to the 'slash-tilde' pattern. By my count, there are 207 additional lines that feature user ids but don't match the '/~' pattern. Find these additional lines and include them in the sum. This one is tricky, so see the discussion for hints, and a narrative of how I arrived at my results.

Expected Output:

2442 matches found (user id and end-of-line bytes)

shl249:  180275381
mm64:  93688659
jl2462:  60035811
myd212:  30060623
mrh382:  21893612
dbb212:  18707260
gb1067:  15346867
bnb224:  11175954
mhy229:  10937128

(Note that the count as well as values for mm64 and dbb212 are different from the previous output.) Remember, if your results are different, don't spend time trying to conform - instead, investigate and/or let's discuss.

[pr]