Projects, Session 3

Advanced Python
Projects, Session 3

Please note: notations (as used in the Introduction to Python course) are no longer required.

Ex. 3.1

Scrape an Article from the New York Times.

In this assignment you will select an article from the New York Times (or choose the one linked below), use requests to download the article's HTML page source, and then scrape relevant information from the article:

the title (please find it in a <meta> tag)
the author name(s) (please find it in a <meta> tag)
the date (please find it in a <meta> tag, then reformat as desired)
the paragraphs of the article

If you can't find a meta tag that has the info (title, author, date) you want, you may use other tags, but please look carefully for it. Please do not use the content of a tag to find the tag. The assumption is you don't know the content and are trying to extract it - you only know the tag name and parameters.

The output of the program should be the text elements printed in a readable format, for example:

Title:  Elk Return to Kentucky, Bringing Economic Life

Byline:  By Oliver Whang and Morgan Hornsby

Date:  2020-06-30

On a bright morning early this spring, David Ledford sat in his
silver pickup at the end of a three-lane bridge spanning a deep
gorge in southeast Kentucky.

The bridge, which forks off U.S. 119, was constructed in 1998 by
former Gov. Paul E. Patton for $6 million. It was seen at the
time as a route to many things: a highway, a strip mall,
housing...  [article continues... please output entire article]

31 text paragraphs total

(Please note that the article is abbreviated above just to save space -- your program should print out the entire article.) Also note that the article paragraphs are in <p> tags, but not every such tag has an article paragraph - you must be selective, not just pull all <p> tags. Sample article link - use this or find one that interests you: https://www.nytimes.com/2020/06/30/science/kentucky-elk-wildlife-coal.html Please keep in mind that although the NYTimes may display a "please register" banner when you view an article, you may be able to view, download and save the source and its content. If you're unable to view the article in a browser you can find a viewable copy of the above article here: http://davidbpython.com/classfiles/elk.html If you run into the NYTimes free article limit, you can use a different browser, clear your cookies or use requests to download an article and save it to your hard drive, opening it from your browser locally using File > Open File.... If your requests.get() call is not returning the article page but what looks like a different HTML page, please look at the text of the response closely. You may see that the full text of your response is relatively brief and you may see text including "Please enable javascript". This means that the Times server has detected that you are not actually a browser and decided to serve you bupkus (i.e., nothing) for your troubles. (They do this because of course they want their articles consumed in a browser, where images and ads can be displayed.) To resolve this issue, you can add a "spoof header" to requests.get() which will tell the NY Times server that you're a browser and thus should not be subject to the above restriction:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 '
                         'Safari/537.36'}

requests.get(url, headers=headers)

(Note that the 'Mozilla' string is actually one big string but was split up this way for formatting purposes) You only need to do this if the downloaded page does not contain the article but instead an error message. Please be reassured that this kind of spoofing is perfectly legal. However if anyone has any concern regarding this "faking out" of the NY Times server, you may use the article file saved as .html instead of downloading a live article from the internet. If you are otherwise prevented from downloading using requests, perhaps because of a firewall at your organization, you can visit the article (or the above linked article) in a browser and choose File > Save Page As from your browser's menu to save it locally -- save with a name ending in .html. Then instead of using requests you can open the file and use .read() to read the file as a string, and pass this string to BeautifulSoup(). For more detail and specific steps, please see the Project Discussion.
HOMEWORK CHECKLIST please use this checklist to avoid common mistakes and ensure a correct solution:

		program does not search for a tag using the content it is trying to extract, e.g. .find() .find_all() and content= with the author's name (this defeats the purpose of the program, which is to find the content based on the tag)
		program does not search for a tag without parameter specifiers, specificaly please do not search for all <p> tags. Use parameter arguments to specify only those <p> tags that are specifically for paragraphs.
		code conforms to points in the Code Quality pdf
		there are no extraneous comments or "testing" code lines
		program runs as shown in the assignment, or if it doesn't, a comment is placed at the top explaining what error or bad output has occurred (it is fine to turn in an incomplete solution if you have a question or would like to discuss ways to improve)

Ex. 3.2

Set up a github.com account and attempt to set up ssl encryption.

Please visit github.com and, if you have not already done so previously, register with github to create a free account. Next, please follow the instructions in the slide deck entitled Getting Started with github. Attempt to follow the instructions to the letter, as a small misstep may mean you're temporarily prevented from proceeding. Please keep in mind that you may run into issues when attempting to set up your private and public key; do your best, but please don't sweat it if you run into issues -- we can work through any kinks when we meet. To submit your work, simply write a message in the submit blank on the homework system that indicates whether you were able to complete all steps, or if you ran into issues, please give the details about the problem (what step you were on, what you're seeing that indicates the issue, and any other information you might provide).

EXTRA CREDIT

Ex. 3.3

(extra credit / supplementary) Research and use the textwrap module to wrap your article output.

Starting with your completed article scraping assignment, wrap the text of the article to 40 characters using the textwrap module. To do this you'll need to read up on the module and figure out how to use it -- no help from me! This assignment will help get you into the groove of working with a brand new library using only information that you find online.

Ex. 3.4

(extra credit / supplementary) Download and parse JSON or CSV data from the Alphavantage website.

The site provides stock price data for a given time period. You'll request data for a particular stock, parse and compile lists of the high, low and close prices, and then calculate the high, low and standard deviation from these prices.

date:  2020-07-02
High:  121.42
Low:  119.26
Standard Deviation:  0.4310024502052363

To make a request to the Alphavantage server, you must take the following steps: API Key: Alphavantage requires you to register and use an API Key to validate your request. Please visit the website to claim your API Key at the alphavantage website. (You can give any email address; the 16-digit key will appear in the web page (i.e., you don't have to confirm your email to receive the key)). On the Website, Select a Report: review the "API Documentation + Examples" link to find the Time Series Intraday report. Make note of the URL query string (the key=val pairs after the question mark). Choose symbol (for example IBM), interval (for example 5min) and datatype (CSV or JSON). Important: you must use the params= argument to .get() or .post to build your query string; please do not build it as a single string! Parse the Data: use the csv or json module to parse this string data. Please do not save to a file. For JSON (recommended): instead of saving to a file pass the string from requests.text to json.loads(), which will return the Python object with the JSON data. For CSV: instead of saving to a file please read the string from requests.text, use .splitlines() to split the string into a list of string lines, and pass this list of lines to csv.reader(). Calculate Results: please use collected lists of values to calculate:

from the "high" column: session high
from the "low" column: session low
from the "close" column: standard deviation (average distance of values from the mean, indicates volatility -- use the statistics module)

Make sure to read only those rows that match the latest date:

from the CSV file: from the second line of the file (first line is the header)
from the JSON file: from the first 'Time Series' dict in the JSON file

However, make sure to include the first 'row' of data in your analysis, since it also includes the price data we need. You'll read all the rows with the latest date, including the very first one - please don't skip it. You must collect lists of values to calculate, please do not use single float values to calculate high or low.

Bonus: use matplotlib to visualize the results! the following code will create a line chart.

import matplotlib.pyplot
plt.plot(close_prices_list)
plt.savefig(f'{ticker}.png')

close_prices_list is the name of the variable I used to collect close prices; ticker is the name of the variable I used to indicate the stock I chose to analyze. If you compare your chart to the chart on a service such as Yahoo Finance or Google Finance, you'll see that the chart is a mirror image of the public chart. Why is this, and how can you fix? This charting will become much more convenient when we delve into pandas and matplotlib, stay tuned!