Advanced Python
Project Discussion, Session 3
Ex. 3.1 | Scrape an Article from the New York Times. |
Step 1: Find a NY Times article and save the source.Visit the New York Times website and select an article of interest to you. (Make sure it's a regular article with paragraphs, not a multimedia feature.) Save this page locally on your computer with File > Save As with an .html extension.From your browser, open the file with File > Open File.... You should see the same file displayed. Scroll down to see the article paragraphs. If the browser will not let you view the article paragraphs, you can register for a free NY Times account (which gives you access to several articles for free) OR you can use the following saved article: http://davidbpython.com/classfiles/elk.html (Note - it isn't necessary to save the file although it's important to know that you can - you might do it to work offline, to open the source in a text editor to do text earches more conveniently, etc.) Step 2: Use requests to download the article.First, assign the remote URL starting with https:// (not the file URL starting with file://) as a string variable in your program.Now use response = requests.get() to download the article, and response.text to retrieve the decoded string of the response. Check the length of this string - it should be very large, over 100,000 characters. Step 3: Pass the text to Beautiful Soup and look at the page title.Once you have passed the text in response.text to BeautifulSoup to create the soup object, you can print soup.title to view the title of the page: you should see a <title> tag containing the same text as that shown in your browser's window or window tab. (However, please don't use this tag for your project's output - we will prefer to use a <meta> tag.)Step 4: View the article in a browser and look for the desired elements.As mentioned in the assignment, we will scrape the following information from the article:
|
|
Viewing the page in the browser, look for each of these elements, noting down the specific text of each (for example, the title text, the author's name and the start of the first few paragraphs). Step 5: View the article page source and search for the title, author and date.To view the source, you can either right click in the browser and choose View > Page Source, or you can use a text editor to open the .html page you saved earlier.In the view source window or text editor, do a text search for the title of the article. You don't have to search for the whole title, you can type in just the first few words of the title - enough to find the title within the page. If you keep searching you'll find that the title appears in multiple places within the page. Please keep searching for the title text until you see it within a <meta> tag -- we will scrape the article from this tag. The reason we will prefer this tag is that <meta> information is intended to be scraped by outside parties, so it is more reliable. Facebook, Twitter, news feeds and others may be interested in displaying the title, authors, date, etc. on their own websites, and so the publisher provides this information specifically so it can be retrieved by scrapers. Copy out the entire <meta> tag from the source and paste into your program as a comment. You will need to see an example of each tag in order to scrape it, so it will be convenient to save it in your program. Search for the rest of the meta information and copy and paste the entire <meta> tag for each of title, author and date into your code as a comment. The date found in a <meta> tag will not be formatted as it is for display. Step 6: Use soup.find() and tag.get() to retrieve the tag, and data from the tag.Begin by searching for the first meta tag that you identified. You'll use soup.find() for this purpose. Of course the arguments to this method must specify the name of the tag and its identifying parameter(s).You can verify that you have the right tag by printing the Tag object that is returned from .find(). This should display the whole tag. (Keep in mind that this is a Tag object, not a string -- but when you print the object it does display the original tag as it appears in the page source.) Now use dict subscripting or .get() on the tag to retrieve the information you need from the tag. Repeat for each of your <meta> tags. Please do not use the content of a tag to find the tag. The assumption is you don't know the content and are trying to extract it - you only know the tag name and parameters. Step 7: View the article page source and search for the article paragraphs.Most websites place text paragraphs in <p> or "paragraph" tags.In the page source, search for the 3rd article paragraph (we're searching for a "middle" paragraph because the initial ones may appear as meta information in the page - we want the article itself) and copy and paste its opening <p> tag, including its parameters, into your code as a comment. <p> tags are "container" tags, which means that they have text between an open tag (<p>) and close tag (</p>). To find the remaining article paragraphs, do a search in the page source for the start of the <p> tag including its uniquely identifying parameter(s). If you search for this tag correctly, you'll see multiple results, and if you search through each result, you should see that each <p> tag contains the next paragraph in the article - you can confirm that you're seeing each paragraph by comparing to the browser display of the article. You can further confirm that you've identified each article paragraph by visually counting the article paragraphs in the browser view and comparing them to the number of search results in your source view when you search for the <p> tag's identifying parameters. This count may not match perfectly, because some paragraphs may hold links to other articles, etc. But it will probably be close in count to the number of articles you see in browser view. However, be certain that you're finding only the article paragraphs, and not every <p> tag. There are 7 tags in my sample article that are not article paragraphs. Step 8: Use soup.find_all() to retrieve a list of all paragraph tags.Again, keep in mind that not all <p> tags will hold article paragraphs -- this is why we will use one or more parameters to identify just the article paragraphs. Again, you'll see the identifying parameter(s) in the page source when you search for article text.Step 9: Print the scraped text in a tidy format.Now you'll pull the data from each meta tag and each of the paragraph tags. The meta tags store the info in parameters, so you'll use dict subscripting or .get() on the tag to get parameter values. The <p> tags have the info in between the open and close tags, so you'll use the .text attribute to get the paragraph text. |
|
As shown in the assignment, please have your program print out the title, byline, date and article paragraphs in plaintext:
Title: Elk Return to Kentucky, Bringing Economic Life Byline: By Oliver Whang and Morgan Hornsby Date: 2020-06-30 On a bright morning early this spring, David Ledford sat in his silver pickup at the end of a three-lane bridge spanning a deep gorge in southeast Kentucky. The bridge, which forks off U.S. 119, was constructed in 1998 by former Gov. Paul E. Patton for $6 million. It was seen at the time as a route to many things: a highway, a strip mall, housing... [ 31 text paragraphs total ] |
|
For date, you can simply slice the longer date to 10 characters to display it in the above format. |
|
Ex. 3.2 | Set up a github.com account and attempt to set up ssl encryption. |
The details are in the Getting Started with github slide deck; as noted in the assignment, please don't sweat it if you run into an issue. Simply follow the directions as best you can, and if you can't get all the way there just note the details in the homework submission blank. If you did complete the steps successfully please say so in the homework submit. |
|
EXTRA CREDIT |
|
Ex. 3.3 | (extra credit / supplementary) Research and use the textwrap modul |
e to wrap your article output. As noted in the assignment, no help from me! But keep in mind that Python documentation can sometimes be easy, sometimes hard to work from and you may find a handy and clearly written blog post or tutorial online -- in fact, it is extremely likely that you will if you look. Don't hesitate to try a couple of different sources to find the information that you need as quickly as possible -- we usually don't have a lot of time for flowery introductions or highly detailed descriptions -- for me personally in a professional situation, it's usually a race to find a clear example that pertains directly to my situation. Good luck! |
|
Ex. 3.4 | (extra credit / supplementary) Download and parse JSON or CSV data from the Alphavantage website. |
Step 1: Get an API key from Alphavantage.Visit the Alphavantage website at alphavantage.co and click on 'Get your free API key today'.The website asks for your email address, but as far as I can tell it doesn't validate the address. As soon as you submit the page will display your API key. Save this key and store as a string variable in your program -- it will be added to the server request. Step 2: Find the report you'd like to download.Back on the Alphavantage main page, click on "API Documentation and Examples".The first example is TIME_SERIES_INTRADAY, part of the Stock Time Series. The parameters are listed with explanations, and below that you'll see 2 sample URLs. Step 3: Note down the parameters you intend to use to configure the report.Note that the parameters shown in the URLs (formatted as key=val&key=val and known as the query string) are the same ones as are listed in the description above. To request time series data you must include the required parameter keys with values reflecting the data you'd like to retrieve.However, please do not build the query string yourself. You are required to ask requests to build the query string by passing a dict of key/value pairs to the params= argument to requests.get(). Step 4: In your program, build a dict of key/value parameters.You can hard-code the dict values in your program, or if you wish you could specify some of the values using arguments to the program, for example the symbol (stock ticker) and interval (1min, 5min, etc.).Step 5: Call the Alphavantage query URL, passing the parameters dict as the params= argument to requests.get().The response object returned from this call will be used to read the info in the response.Step 6: Retrieve the data and read using the csv or json modules.Please keep in mind that the CSV or JSON data does not need to be saved to a file -- and should not be. Parse the data as follows:For CSV: csv.reader() will accept any iterable that delivers a string line with each iteration, including a list of string lines. So retrieve the string response with response.text, then use .splitlines() to split the string into a list of string lines, and then pass that list to csv.reader(). For JSON: json can read the string response with json.loads() -- there is no need to save to a file. Step 7: Establish lists to hold high, low and close prices, and make note of the date in the first row.I began by initializing 3 lists, assigned to variable names high, low and close. The intention is to use the "summary" functions max(), min() and statistics.stdev() to generate the final output.The data will cover more than one date, but we would like to see just the data for the latest date. Since the data is returned in reverse-chronological order, the date in the first record can be used to identify all records with the latest date. For CSV I used next() twice on the reader object -- the first row is the header, and the 2nd is the first line of data. Then I appended the high, low and close values to their respective lists. I then retrieved the date from this first row so I could filter the remaining rows that match this date. For JSON I was able to retrieve the latest date from the "3. Last Refreshed" value in the dict paired with "Meta Data" in the overall dict. Step 8: Reading only those records that match the latest date, build 3 lists of values from the 'high', 'low' and 'close' fields.Loop through the data and retrieve these 3 values from each record, making sure to work with only those rows that match the latest date saved earlier.For CSV, loop through and subscript the date for each row; if the date field matches the latest date, subscript the 3 price values and append to their respective dicts. For JSON, loop through the datetime keys in the dict paired with "Time Series (5 min)"; if the date matches the latest date, retrieve each "inner" dict and subscript the 3 price values and append to their respective lists. Step 9: Use 'summary' functions with the list arguments to determine the max value from the high prices, min value from the low prices and standard deviation from the close prices.Step 10: Plot a chart!To see the chart, look for a new file added to the directory where you script has been saved. |
|
If you compare your chart to the stock price chart on Yahoo Finance or Google Finance, you'll see that the line is a mirror image of your chart. This is because the Alphavantage data is presented in reverse chronological order. To reverse the values, you can use a list 'step' value of -1:
rclose = close_prices[::-1]
|
|
This slice says "give me all values, but with a negative step" - in other words, reverse the order of the items. |
|
An alternative way to reverse a list is to use reversed():
rclose = list(reversed(close_price))
|
|
As this function returns an iterator (which can be used in a for loop), it must be passed to list() to produce a list. This charting will become much more convenient when we delve into pandas and matplotlib, stay tuned! |
|