Python 3home |
The network protocol requires a shift in thinking about how our programs work.
The Client-Server Networking Protocols When we run a program locally, we expect the program to do some work, print output to the screen, then for to (eventually) stop executing. When we run a program that sends a message over a network (a client program), it relies on another program to respond (a server program). This reliance on another program means that many problems may arise in networking, including blocking conditions (both programs are waiting for the other to say something, or both programs are talking at the same time), or when one program is unable to understand the other. The client-server protocol is a simple understanding between two programs:
HTTP Protocol In addition to understanding the client-server protocol of sending and listening, both programs must "speak the same language", meaning they need a more detailed protocol in order to understand each other. This protocol defines what the client may say when making a request and what the server may say in response. The HTTP (HyperText Transport Protocol) is a client-server protocol that defines how clients and servers will communicate over the internet. It is in use anytime you use a web browser, and it is the reason you see an http:// in front of most URLs. In this session we'll learn how to use our Python program as a client (i.e., to replace the browser), how to construct and send HTTP requests and read HTTP responses.
A request may consist of headers, a URL, parameters, and a body.
Parts of an HTTP Request
URL | the address of the server and the resource requested (i.e., a file or program) |
method | (not a Python method) the type of request, usually GET (requesting information from the server) or POST (posting data to the server) |
parameters | key/value pairs that appear with the URL in a query string |
headers | meta information about the request (date and time, the computer and program making the request, what types of images and files the browser can display, etc.), also as key/value pairs; may include a cookie the client sends to identify itself |
body | data being sent to the server, also key/value pairs |
Parts of an HTTP Response
response code | a 3-digit code that indicates whether the request succeeded (200), the resource was not found (404), caused an error (500), etc. |
headers | meta information about the response and computers involved; may include a cookie to identify this user, to be stored by the client |
body | data being returned from the server, also key/value pairs |
The server-side program 'http_reflect' will show you the contents of your request.
Each of the link and 2 forms below send requests to my server-side program http_reflect, which simply reflects the HTTP data (headers, parameters and body) sent to it. Here is GET request using a link: http://davidbpython.com/cgi-bin/http_reflect?a=1&b=2 Here is form that generates a POST request:
the form | HTML source for the form |
---|---|
<FORM ACTION="http://davidbpython.com/cgi-bin/http_reflect" METHOD="POST"> "a" parameter: <INPUT name="a"><br> "b" parameter: <INPUT name="b"><br> <INPUT type="submit" value="send!"> </FORM> |
Here is a file upload:
the form | HTML source for the form |
---|---|
<form enctype = "multipart/form-data" action = "http://davidbpython.com/cgi-bin/http_reflect" method = "post"> <p style="font-size: 24px">File: <input type = "file" name = "filename" /></p> <p style="font-size: 24px"><input type = "submit" value = "Upload" /></p> </form> |
This module can handle most aspects of HTTP interaction with a server.
Basic Example: Download and Save Data
import requests
url = 'https://www.python.org/dev/peps/pep-0020/' # the Zen of Python (PEP 20)
response = requests.get(url) # a response object
text = response.text # text of response
# writing the response to a local file -
# you can open this file in a browser to see it
wfh = open('pep_20.html', 'w')
wfh.write(text)
wfh.close()
More Complex Example: Send Headers, Parameters, Body; Receive Status, Headers, Body
import requests
url = 'http://davidbpython.com/cgi-bin/http_reflect' # my reflection program
div_bar = '=' * 10
# headers, parameters and message data to be passed to request
header_dict = { 'Accept': 'text/plain' } # change to 'text/html' for an HTML response
param_dict = { 'key1': 'val1', 'key2': 'val2' }
data_dict = { 'text1': "We're all out of gouda." }
# a GET request (change to .post for a POST request)
response = requests.get(url, headers=header_dict,
params=param_dict,
data = data_dict)
response_status = response.status_code # integer status of the response (OK, Not Found, etc.)
response_headers = response.headers # headers sent by the server
response_text = response.text # body sent by server
# outputting response elements (status, headers, body)
# response status
print(f'{div_bar} response status {div_bar}\n')
print(response_status)
print(); print()
# response headers
print(f'{div_bar} response headers {div_bar}\n')
for key in response_headers:
print(f'{key}: {response_headers[key]}\n')
print()
# response body
print(f'{div_bar} response body {div_bar}\n')
print(response_text)
Note that if import requests raises a ModuleNotFoundError exception, requests must be installed. It is not included with the Standard Distribution from python.org.
Specific techniques for reading the most common data formats.
CSV: feed string response to .splitlines(), then to csv.reader:
import requests
import csv
url = 'path to csv file'
response = requests.get(url)
text = response.text
lines = text.splitlines()
reader = csv.reader(lines)
for row in reader:
print(row)
JSON: requests accesses built-in support:
import requests
url = 'path to json file'
response = requests.get(url)
obj = response.json()
print(type(obj)) # <class 'dict'>
The status code '200' means OK, but other codes may mean an error.
Every HTTP response is expected to return a 3-digit status code. These codes range from 204 (No Content, or there is no data in the response) to 401 (Unauthorized, or you do not have the privileges to see this page). See a list of status codes here.
import requests
url = 'https://www.python.org/dev/peps/pep-0020/' # the Zen of Python (PEP 20)
response = requests.get(url) # a response object
code = response.status_code # 200
print(requests.status_codes._codes[code]) # ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓')
print(requests.status_codes._codes[500]) # ('internal_server_error', 'server_error', '/o\\', '✗')
In many cases we just want to know whether the requests succeeded. As there are many response codes, some of which mean success and some failure, requests can be made to raise an exception if a 'failure' code was received:
import requests
response = requests.get('http://www.yahoo.com/some/wrong/url')
response.raise_for_status()
# raise HTTPError(http_error_msg, response=self)
# requests.exceptions.HTTPError: 404 Client Error:
# Not Found for url: http://yahoo.com/some/wrong/url
Data is returned from a server as bytes; requests can decode most plaintext correctly.
Note: if this discussion of encoding is not immediately clear, see the "Understanding Unicode and Character Encodings" slide deck. All plaintext (i.e., characters that we see in files such as .txt, .csv, .json, .html, .xml, etc. is encoded as integers (called 'bytes' in this context). Bytes are decoded to characters using an encoding. There are many possible encodings on the internet. Many HTML documents use the 'charset' value in the 'Content-Type' header to specify the encoding. If this value is not present in the document, requests uses the chardet library to "sniff" the correct encoding.
requests attempts to handle encoding seamlessly through its .text attribute:
import requests
url = 'http://davidbpython.com/cgi-bin/http_reflect'
r = requests.get(url)
print(r.encoding) # 'utf-8' (this is specified in the response text)
print(r.text) # requests uses this encoding to decode the text
print(r.apparent_encoding) # 'ascii' (this is what it looks like to requests)
r.encoding = 'utf-16' # force requests to use a different encoding
print(r.text) # oops, wrong encoding:
# '⨪⨪\u202a呈偔删䙅䕌呃⨠⨪⨪琊楨\u2073牰杯慲\u206d敲汦捥獴琠敨攠敬敭瑮\u2073景琠敨䠠呔⁐敲畱獥⁴琊慨⁴慷...
Keep in mind that requests almost always handles encodings correctly; one in which we had to set the encoding ourselves is rare.
To download raw bytes (for example, images or sound files), we use the response.content attribute and write as binary text:
import requests
url = 'https://davidbpython.com/advanced_python/supplementary/python.png' # a URL to an image
response = requests.get(url) # a response object
image_bytes = response.content # response as bytes
print(f'{len(image_bytes)} bytes') # 90835 bytes
wfh = open('python.png', 'wb') # preparing a file to receive bytes
wfh.write(image_bytes)
wfh.close()
We can pass raw bytes to requests to upload a file.
Keep in mind that you cannot upload a file to a directory on a server - this is designed to upload files to applications that are ready to receive them.
import requests
url = 'https://davidbpython.com/cgi-bin/http_reflect'
# open file for reading without decoding (returns a bytestring)
file_bytes = open('../test_file.txt', 'rb')
response = requests.post(url, files={'file': file_bytes})
print(response.text)
files={ 'file': ('test_file.txt', file_bytes,
'text/plain') }
print(response.status_code) # 200 (if all is well)
text/plain is a mime type and denotes that we are uploading a simple text file. Other types include text/csv, text/html and text/xml, image/jpeg, application/json, application/zip, application/vnd.ms-excel and application/octet-stream (default for non-text files). See Common Mime Types.
For those who cannot install requests, urllib is available.
Although the requests module is strongly favored by some for its simplicity, it has not yet been added to the Python builtin distribution.
The urlopen method takes a url and returns a file-like object that can be read() as a file:
import urllib.request
my_url = 'http://www.yahoo.com'
readobj = urllib.request.urlopen(my_url)
text = readobj.read()
print(text)
readobj.close()
Alternatively, you can call readlines() on the object (keep in mind that many objects that can deliver file-like string output can be read with this same-named method:
for line in readobj.readlines():
print(line)
readobj.close()
The text that is downloaded is CSV, HTML, Javascript, and possibly other kinds of data. TypeError: can't use a string pattern on a bytes-like object This error may occur with some websites. It indicates that an undecoded unicode response was received.
The response usually comes to us as a special object called a byte string. In order to work with the response as a string, we may need to use the decode() method:
text = text.decode('utf-8')
UnicodeEncodeError This error may occur if the downloaded page contains characters that Python doesn't know how to handle. It is in most cases fixed by using the text.decode line above on the text immediately after it is retrieves from urlopen(). SSL Certificate Error Many websites enable SSL security and require a web request to accept and validate an SSL certificate (certifying the identity of the server). urllib by default requires SSL certificate security, but it can be bypassed (keep in mind that this may be a security risk).
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
my_url = 'http://www.nytimes.com'
readobj = urllib.request.urlopen(my_url, context=ctx)
Download binary files: images and other files can be saved locally using urllib.request.urlretrieve().
import urllib.request
urllib.request.urlretrieve('http://www.azquotes.com/picture-quotes/quote-python-is-an-experiment-in-how-much-freedom-programmers-need-too-much-freedom-and-nobody-guido-van-rossum-133-51-31.jpg', 'guido.jpg')
Note the two arguments to urlretrieve(): the first is a URL to an image, and the second is a filename -- this file will be saved locally under that name.
When including parameters in our requests, we must encode them into our request URL. The urlencode() method does this nicely:
import urllib.request, urllib.parse
params = urllib.parse.urlencode({'choice1': 'spam and eggs', 'choice2': 'spam, spam, bacon and spam'})
print("encoded query string: ", params)
f = urllib.request.urlopen("http://www.yahoo.com?{}".format(params))
print(f.read())
this prints:
encoded query string: choice1=spam+and+eggs&choice2=spam%2C+spam%2C+bacon+and+spam choice1: spam and eggs<BR> choice2: spam, spam, bacon and spam<BR>