Project Discussion, Session 4

Advanced Python
Project Discussion, Session 4


	GIT AND GITHUB
4.1	Post a .py file to your github account.
	The instructions for setting up a repo and adding/committing/pushing a file are in the slides entitled "Getting Started with Github". Please use those instructions to complete this homework, and send questions. Please keep in mind that this task is relatively new to the course, so there may be errors or omissions in the instructions - questions and comments/observations are very much welcome.


	REGULAR EXPRESSIONS

	SPECIAL NOTE: please do not use the vertical bar with whole patterns ("if it matches this whole pattern, or it matches this whole pattern, or it matches this whole pattern", etc.) as I will return such solutions. We must learn how to write regexes that match on variations without doing this - usually using quantifiers. ALSO: my results could be incorrect or incomplete! Please don't get hung up on matching my results precisely. If there are differences, let's discuss how to explore them.

	GENERAL TIPS Start with small test programs and small files. This week's datasets are large. You should test your regexes against sample data first. Make a small program with one line from the file, and work with it until you can get your pattern to work. Then move on to the larger data and larger program. Regexes never tell you where they didn't match. If a regex doesn't match, you'll get no feedback about what's wrong with the pattern. Rather than take guesses, I suggest you remove characters from the pattern until it dos match, then gradually build it up, testing it as you go.
4.2	Build a dict from a web server log that sums up web server usage by user id.
	I found it easiest to use two regexes: one to match on the home directory (e.g. ~cmk380 or ~dbb212) and one to match on the bytes number (the last integer on each line). It is possible to grab these in one pattern, but not necessary. It is a best practice to be as specific as possible when trying to match, so as to avoid false positives. Therefore please make sure to specify as many identifying characters as possible. /~ (slash, tilde) at the start of the pattern followed by an NYU ID pattern (e.g. dbb212 or mm64) is probably enough to positively identify the home directory name.
	However, as noted in the assignment, there are a number of lines in the file that will not match one or both regexes. Python will raise an exception if you attempt to call .group() on a search that didn't match. Therefore you should assign re.search() to a variable and test the variable for truth in an if expression before attempting to call .group() on it: for line in file: matchobj1 = re.search(<homedir pattern>, line) matchobj2 = re.search(<bytes pattern>, line) if not matchobj1 or not matchobj2: continue id = matchobj1.group(1) bytes = matchobj2.group(1) # or group(0) if you're pulling the entire match
	It can be extremely helpful to explore the data by seeing what your pattern didn't match. If a line didn't match either pattern, you can also print the line.
	The summing dictionary is something we explored in the Introduction to Python course. Here is some pseudocode (bold marks actual code): bytes_count = {} for line in file search for nyu id extract nyu id from line (assign to nyu_id) search for bytes extract bytes from line (assign to bytes) if nyu_id not in bytes_count: bytes_count[nyu_id] = 0 bytes_count[nyu_id] = bytes_count[nyu_id] + bytes
	We sort a dictionary by value by using key=dict.get: for key in sorted(bytes_count, key=bytes_count.get): print(f'{key}: {bytes_count[key]}')


4.3	Extract all mentions of stock symbols and price changes from a news article.
	I found this one to be pretty simple as long as you don't use any paranthetical groupings (for the extra challenge, you will need them). To match on a + or - without using the parentheses, use the custom character class [+-] to specify either a + or -. Without groups .findall() will return a list of text from matches on the search string. Passing this list to sorted() will sort them by ticker. For the extra challenge, you must group the ticker and the price so you can isolate the price for sorting. .findall() with 2 groups will return a list of 2-item tuples. Looping through these and converting the pct change to a float (the sign (+/-) can be included in float()), add each pair as key/value to the dict. Sort the dict by value and use reverse=True to sort highest to lowest.

4.4	(Extra credit) Find user ids in the web server access log that do not conform to the 'slash-tilde' pattern.
	For the extra credit, your job is to identify patterns that you aren't expecting - this can be done through a combination of programmatic searching and visual appraisal (actually eyeballing the results). The basic idea is that there are lines that match the user id pattern (2-3 letters followed by one or more numbers) but don't match the slash-tilde pattern (e.g. '/~dbb212'). Of these, some are definitely user ids, but others might not be (because the user id letter-number pattern could appear in things like filenames or folder names). Consider this workflow: print out lines that match the user id but not the slash-tilde pattern visually identify further patterns that include user ids; devise a regex to match on these as well print out lines that match the user id but not the slash-tilde pattern or this new pattern; repeat the above step keep a count of everything you match (how many slash-tilde, how many other patterns) so you get a view of the entire file
	I personally found that arriving at a satisfactory solution was quite challenging as there are several anomalies, including at least one line that matches more than one of my user id patterns (which made the counts confusing). But I think this problem very clearly underscores the isssues that can arise when trying to find patterns in large datasets. Below I have made some notes that describe how I arrived at my result counts. I think you should explore a bit on your own, but when you're ready, read further for a narrative of what I considered and did to get there. If you see anything interesting or incorrect, let me know your thoughts! The main issue in identifying all lines that have a user id is that that the 'letter-number' user id pattern (e.g. 'dbb212') could appear in a lot of other places, such as a filename or folder name, without actually being a user id. However it is possible (and appears to be the case) that user ids can be identified by other characters before or after it, as is true of the 'slash-tilde' pattern (e.g. '/~dbb212') that identify the 2235 matches mentioned in the main part of the assignment. So our goal is to understand how else a user id can be represented in other patterns, and how we can programmatically identify them, without matching on any string that looks like a user id pattern but is clearly not a user id. Therefore, we must do some 'witnessing' of examples of the lines that match, so we can discover the additional patterns that will match on user ids and not on the simple letter-number pattern. The first thing I did was print out all lines that match the simple pattern (two or more letters followed by one or more numbers, e.g. 'dbb212') but don't match the slash-tilde pattern (e.g. '/~dbb212'). A few hundred lines were printed, and scanning these visually I saw another pattern that seemed to positively identify a user id: this pattern appeared on exactly 200 lines (I had my program count these, I didn't attempt to count them all visually!). (Actually I first identified a more specific pattern that matches on 196 lines, but then realized that there's a more general pattern that identifies an additional 4.) Then, printing only those lines that have the 'letter-number' pattern but don't match the newly identified pattern, I found one other pattern that I felt could positively identify a user id. I had my program count 8 of these. Lastly, I found 46 lines that had the letter-number pattern but did not appear to contain a user id. As mentioned, these were lines that featured folder names or filenames that just happened to match the user id pattern but were not user ids. I actually visually scanned through each of these to see if there were any other patterns, but could not identify any. (Scanning all results visually would not be a luxury with a larger dataset - with a much bigger file, we'd need to be satisfied with a spot check). So in the end, I found 2442 lines that have user ids (2235 (slash-tilde pattern) + 200 (2nd discovered pattern) + 8 (3rd discovered pattern). (BTW if you add the above counts you'll arrive at 2443, not 2442. This discrepancy was a huge annoyance to me until I realized that two of the patterns actually appear on the same line! So the true count of lines with user ids (by my analysis, not necessarily correct) is 2442.) I can share the other patterns I discovered upon request. Remember, my results could be wrong. I know they're either right or very close, but in the real world you can never be 100% certain! Let me know if your results differ.