Advanced Python
Projects, Session 4
REGULAR EXPRESSIONS |
||||||||||||||||||
SPECIAL NOTE: please do not use the vertical bar with whole patterns ("if it matches this whole pattern, or it matches this whole pattern, or it matches this whole pattern", etc.) as I will return such solutions. We must learn how to write regexes that match on variations without doing this - usually using quantifiers. ALSO: my results could be incorrect or incomplete! Please don't get hung up on matching my results precisely. If there are differences, let's discuss how to explore them. LASTLY: there are some good tips at the start of the project discussion. |
||||||||||||||||||
4.1 | Post a .py file to your github account. | |||||||||||||||||
Please review the slide deck titled "Getting Started with Github" that we used in class to set up our github and establish the ssh public/private keys. Scroll down to the sections titled "Perform Your First Commit and Push" and "The git commit cycle". Now that your git is set up and you have started a git repository linked to github, please now follow the remaining instructions to:
|
||||||||||||||||||
Once pushed, you should be able to view the changes on github through the browser. For your submission for this assignment, please submit *just the link* to your github account. This link is the URL that you see at the top of the browser when you're viewing your repo. I'll use the link to visit your repo and see that you have made a change to the repo. You may also want to note the change that you made, or just if your repo has a new file or the README has a notable change in it, I'll be able to see that you were successful. Please do not use the browser interface to effect any of these changes to the repo. It must be done through the git commit cycle as described in the slide deck. Let me know your questions, thanks! |
||||||||||||||||||
4.2 | Build a dict from a web server log that sums up web server usage by user id. | |||||||||||||||||
An NYU user id consists of 2-3 letters and 1 or more numbers. My user id is dbb212. |
||||||||||||||||||
On each line, access_log.txt contains a log line describing a web request -- here is an example:
172.26.93.208 - - [28/Jun/2012:21:00:45 -0400] "GET /~cmk380/pythondata/image3a.txt HTTP/1.1" 200 4487 |
||||||||||||||||||
cmk380: the home directory (also the user id) for user hosting the requested resource 4487 (at the end of the line): the number of bytes downloaded in the request Not all lines have an NYU ID or a bytes number. If a line is missing either (i.e. if either of the patterns did not match), skip it. In addition, the pattern of 2-3 letters followed by one or more numbers may appear in other places, such as file or folder names. These are not user ids and should not be counted. To do this, and for the credit portion of this assignment include the "slash tilde" as part of the pattern so that only ids of this pattern are selected. Use a summing dictionary to sum up the bytes found for each user id, sort the dict by value and show the users sorted from greatest to least bytes downloaded. List only those users that have > 10000000 bytes. |
||||||||||||||||||
Expected Output:
2235 matches found (both user id and end-of-line bytes found on the line) shl249: 180275381 mm64: 92987649 jl2462: 60035526 myd212: 30060623 mrh382: 21893612 dbb212: 18108625 gb1067: 15346867 bnb224: 11175954 mhy229: 10937128 |
||||||||||||||||||
|
||||||||||||||||||
4.3 | Extract all mentions of stock symbols and price changes from a news article. | |||||||||||||||||
The news article in market_discussion.txt contains a number of references to stock symbols and pct change for the stock at that moment:
... according to The Wall Street Journal, SoftBank Group SFTBY, +0.65% didn’t have ... |
||||||||||||||||||
The ticker will be one or more capital letters followed by a comma, and the price change will start with a sign (+ or -), have at least one digit after the sign, and have two decimal places after the period, followed by a percent sign. Please do not split or loop. Use fh.read() to read the file into a single string, then re.findall(), which will search through the entire text and return a list of strings with all matches. Then, please sort the resulting list, and loop through and print it. |
||||||||||||||||||
Expected Output:
COMP, -0.44 CVS, +5.02 DJIA, -0.15 HPQ, +10.54 HUM, +3.42 SFTBY, +0.65 SPX, -0.12 TMUBMUSD, -2.52 XRX, +2.67 |
||||||||||||||||||
Extra Challenge: collect the results in a dict and sort by value, highest to lowest. To do this, you'll use .findall() with groups (grouping the ticker and the price), which returns a list of tuples. Again, please do not split the file or loop through lines or words. You must use .findall(). |
||||||||||||||||||
Extra challenge expected output:
HPQ, 10.54 CVS, 5.02 HUM, 3.42 XRX, 2.67 SFTBY, 0.65 SPX, -0.12 DJIA, -0.15 COMP, -0.44 TMUBMUSD, -2.52 |
||||||||||||||||||
HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:
|
||||||||||||||||||
4.4 | (extra credit) Find user ids in the web server access log that do not conform to the 'slash-tilde' pattern. By my count, there are 207 additional lines that feature user ids but don't match the '/~' pattern. Find these additional lines and include them in the sum. This one is tricky, so see the discussion for hints, and a narrative of how I arrived at my results. | |||||||||||||||||
Expected Output:
2442 matches found (user id and end-of-line bytes) shl249: 180275381 mm64: 93688659 jl2462: 60035811 myd212: 30060623 mrh382: 21893612 dbb212: 18707260 gb1067: 15346867 bnb224: 11175954 mhy229: 10937128 |
||||||||||||||||||
(Note that the count as well as values for mm64 and dbb212 are different from the previous output.) Remember, if your results are different, don't spend time trying to conform - instead, investigate and/or let's discuss. |
||||||||||||||||||