Advanced Python
Projects, Session 3
Please note: notations (as used in the Introduction to Python course) are no longer required. |
|||||||||||||||
Ex. 3.1 | Scrape an Article from the New York Times. | ||||||||||||||
In this assignment you will select an article from the New York Times (or choose the one linked below), use requests to download the article's HTML page source, and then scrape relevant information from the article:
|
|||||||||||||||
If you can't find a meta tag that has the info (title, author, date) you want, you may use other tags, but please look carefully for it. Please do not use the content of a tag to find the tag. The assumption is you don't know the content and are trying to extract it - you only know the tag name and parameters. |
|||||||||||||||
The output of the program should be the text elements printed in a readable format, for example:
Title: Elk Return to Kentucky, Bringing Economic Life Byline: By Oliver Whang and Morgan Hornsby Date: 2020-06-30 On a bright morning early this spring, David Ledford sat in his silver pickup at the end of a three-lane bridge spanning a deep gorge in southeast Kentucky. The bridge, which forks off U.S. 119, was constructed in 1998 by former Gov. Paul E. Patton for $6 million. It was seen at the time as a route to many things: a highway, a strip mall, housing... [article continues... please output entire article] 31 text paragraphs total |
|||||||||||||||
(Please note that the article is abbreviated above just to save space -- your program should print out the entire article.) Also note that the article paragraphs are in <p> tags, but not every such tag has an article paragraph - you must be selective, not just pull all <p> tags. Sample article link - use this or find one that interests you: https://www.nytimes.com/2020/06/30/science/kentucky-elk-wildlife-coal.html Please keep in mind that although the NYTimes may display a "please register" banner when you view an article, you may be able to view, download and save the source and its content. If you're unable to view the article in a browser you can find a viewable copy of the above article here: http://davidbpython.com/classfiles/elk.html If you run into the NYTimes free article limit, you can use a different browser, clear your cookies or use requests to download an article and save it to your hard drive, opening it from your browser locally using File > Open File.... If your requests.get() call is not returning the article page but what looks like a different HTML page, please look at the text of the response closely. You may see that the full text of your response is relatively brief and you may see text including "Please enable javascript". This means that the Times server has detected that you are not actually a browser and decided to serve you bupkus (i.e., nothing) for your troubles. (They do this because of course they want their articles consumed in a browser, where images and ads can be displayed.) To resolve this issue, you can add a "spoof header" to requests.get() which will tell the NY Times server that you're a browser and thus should not be subject to the above restriction: |
|||||||||||||||
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 '
'Safari/537.36'}
requests.get(url, headers=headers)
|
|||||||||||||||
(Note that the 'Mozilla' string is actually one big string but was split up this way for formatting purposes)
You only need to do this if the downloaded page does not contain the article but instead an error message.
Please be reassured that this kind of spoofing is perfectly legal. However if anyone has any concern regarding this "faking out" of the NY Times server, you may use the article file saved as .html instead of downloading a live article from the internet.
If you are otherwise prevented from downloading using requests, perhaps because of a firewall at your organization, you can visit the article (or the above linked article) in a browser and choose File > Save Page As from your browser's menu to save it locally -- save with a name ending in .html. Then instead of using requests you can open the file and use .read() to read the file as a string, and pass this string to BeautifulSoup().
For more detail and specific steps, please see the Project Discussion.
|
|||||||||||||||
Ex. 3.2 | Set up a github.com account and attempt to set up ssl encryption. | ||||||||||||||
Please visit github.com and, if you have not already done so previously, register with github to create a free account. Next, please follow the instructions in the slide deck entitled Getting Started with github. Attempt to follow the instructions to the letter, as a small misstep may mean you're temporarily prevented from proceeding. Please keep in mind that you may run into issues when attempting to set up your private and public key; do your best, but please don't sweat it if you run into issues -- we can work through any kinks when we meet. To submit your work, simply write a message in the submit blank on the homework system that indicates whether you were able to complete all steps, or if you ran into issues, please give the details about the problem (what step you were on, what you're seeing that indicates the issue, and any other information you might provide). |
|||||||||||||||
EXTRA CREDIT |
|||||||||||||||
Ex. 3.3 | (extra credit / supplementary) Research and use the textwrap module to wrap your article output. | ||||||||||||||
Starting with your completed article scraping assignment, wrap the text of the article to 40 characters using the textwrap module. To do this you'll need to read up on the module and figure out how to use it -- no help from me! This assignment will help get you into the groove of working with a brand new library using only information that you find online. |
|||||||||||||||
Ex. 3.4 | (extra credit / supplementary) Download and parse JSON or CSV data from the Alphavantage website. | ||||||||||||||
The site provides stock price data for a given time period. You'll request data for a particular stock, parse and compile lists of the high, low and close prices, and then calculate the high, low and standard deviation from these prices.
date: 2020-07-02 High: 121.42 Low: 119.26 Standard Deviation: 0.4310024502052363 |
|||||||||||||||
To make a request to the Alphavantage server, you must take the following steps: API Key: Alphavantage requires you to register and use an API Key to validate your request. Please visit the website to claim your API Key at the alphavantage website. (You can give any email address; the 16-digit key will appear in the web page (i.e., you don't have to confirm your email to receive the key)). On the Website, Select a Report: review the "API Documentation + Examples" link to find the Time Series Intraday report. Make note of the URL query string (the key=val pairs after the question mark). Choose symbol (for example IBM), interval (for example 5min) and datatype (CSV or JSON). Important: you must use the params= argument to .get() or .post to build your query string; please do not build it as a single string! Parse the Data: use the csv or json module to parse this string data. Please do not save to a file. For JSON (recommended): instead of saving to a file pass the string from requests.text to json.loads(), which will return the Python object with the JSON data. For CSV: instead of saving to a file please read the string from requests.text, use .splitlines() to split the string into a list of string lines, and pass this list of lines to csv.reader(). Calculate Results: please use collected lists of values to calculate:
|
|||||||||||||||
Make sure to read only those rows that match the latest date:
|
|||||||||||||||
Bonus: use matplotlib to visualize the results! the following code will create a line chart.
import matplotlib.pyplot
plt.plot(close_prices_list)
plt.savefig(f'{ticker}.png')
|
|||||||||||||||
close_prices_list is the name of the variable I used to collect close prices; ticker is the name of the variable I used to indicate the stock I chose to analyze. If you compare your chart to the chart on a service such as Yahoo Finance or Google Finance, you'll see that the chart is a mirror image of the public chart. Why is this, and how can you fix? This charting will become much more convenient when we delve into pandas and matplotlib, stay tuned! |
|||||||||||||||