Python 3

home

All Slides on One Page




Variable Assignment; Object and Attribute Introspection

Variables Are Names Bound to Objects

All variable assignment is done dynamically


Basic variable assignment

x = 5                            # int
y = [1, 2, 3]                    # list
z = {'a': 1, 'b': 2}             # dict

Multi-target assignment

x, y, z = 5, 6, 7

Assignment through function arguments

def do(a, b, c):
    print(a, b, c)      # 5 6 7

x = 5
y = 6
z = 7

do(x, y, z)

Assignment through return values

def do():
    return 5, 6, 7

x, y, z = do()
print(x, y, z)       # 5 6 7

Assignment through 'for' looping

x = [1, 2, 3]
for item in x:
    print(item)      # 1, then 2, then 3 (at each iteration of loop)

Dynamic Assignment

A name is just a label; it is not at all related to the object


x = 5
x = 'hello'
x = ['a', 'b', 'c']

print(x)               # ['a', 'b', 'c']
del x                 # "undo" assignment (dissolve binding)

print(x)               # NameError:  name 'x' is not defined

Variables are Names Bound to Objects by Reference

Every assignment is assigning a pointer or reference to the object, not the object itself


x = ['a', 'b', 'c']

y = x                # reference assignment, not object copying

x.append('d')

print(x)              # ['a', 'b', 'c' 'd']
print(y)              # ['a', 'b', 'c' 'd']

Object Attribute Introspection

Every object has attributes, and each attribute points to another object (string, method, function/method, etc.)


As a general computer science term, Introspection means an object revealing information about itself when asked. Python supports a great deal of introspection.


An object's attributes are accessed using object.attribute syntax:

var = 'hello'

print(var.upper)        # <method 'upper' of 'str' objects>

Python indicates that attribute upper is a method of the string var.


Attributes are Sometimes Methods Every method is an attribute, but not all attributes are methods.


We can see a list of attributes accessible to an object using the dir() function.

var = 'hello'

print(dir(var))

This prints:

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',
 '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
 '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__',
 '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__',
 '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__',
 '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize',
 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find',
 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit',
 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle',
 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition',
 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip',
 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate',
 'upper', 'zfill']

help() and the __doc__ attribute

To see documentation on any object or attribute, use help(object)


x = 'hello'

print(help(x.upper))    # did not call upper(), but referred to it

             # Help on method_descriptor:

             # upper(...)
             #     S.upper() -> str

             #     Return a copy of S converted to uppercase.

The 'docstring' for any object is actually stored in a "magic" attribute called __doc__:

print(x.upper.__doc__)
        # S.upper() -> str

        # Return a copy of S converted to uppercase.

The Interactive help> utility

Use command help() at the Python prompt to find help on any Python feature.


There are a number of features of Python that don't have built-in docstrings, for example the statement del and operator in. For these we can use the interactive help utility by calling help() directly from within a Python prompt.


launching the Python interactive prompt:

$ python
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

launching the help utility:

>>> help()

Welcome to Python 3.6's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/3.6/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".

help>



Files, Directories and Command Line Arguments

Introduction: Files, Directories and Command Line Arguments

This unit is about reading from and writing to files, listing directories, and working from the command line.


This unit is about the "outside world" of your computer's operating system -- its files and directories and other programs running on it, as well as the command line, which is the prompt from which we can run our Python programs. The data we have been parsing has been accessed by us from a specific location, but often we are called upon to marshal data from many locations in our filesystem. We may also need to search for these files. We also sometimes need to be able to read data produced by programs that reside on our filesystem. Our python scripts can run Unix or Windows utilities, installed programs, and even other python programs from within our running Python script, and capture the output.


Objectives for the Unit: Files, Directories and Command Line Arguments


Summary structure: sys.argv

sys.argv is a list that holds string arguments entered at the command line


sys.argv example


a python script myscript.py

import sys                           # import the 'system' library

print('first arg: ' + sys.argv[1])   # print first command line arg
print('second arg: ' + sys.argv[2])  # print second command line arg

running the script from the command line

$ python myscript.py hello there
first arg: hello
second arg: there

sys.argv is a list that is automatically provided by the sys module. It contains any string arguments to the program that were entered at the command line by the user. If the user does not type arguments at the command line, then they will not be added to the sys.argv list. sys.argv[0] sys.argv[0] always contains the name of the program itself Even if no arguments are passed at the command line, sys.argv always holds one value: a string containing the program name (or more precisely, the pathname used to invoke the script). example runs


a python script myscript2.py

import sys                            # import the 'system' library

print(sys.argv)

running the script from the command line (passing 3 arguments)

$ python myscript2.py hello there budgie
['myscript2.py', 'hello', 'there', 'budgie']

running the script from the command line (passing no arguments)

$ python myscript2.py
['myscript2.py']

Summary Exception: IndexError with sys.argv (when user passes no argument)

An IndexError occurs when we ask for a list index that doesn't exist. If we try to read sys.argv, Python can raise this error if the arg is not passed by the user.


a python script addtwo.py

import sys                            # import the 'system' library

firstint = int(sys.argv[1])
secondint = int(sys.argv[2])

mysum = firstint + secondint

print('the sum of the two values is {}'.format(mysum))

running the script from the command line (passing 2 arguments)

$ python addtwo.py 5 10
the sum of the two values is 15

exception! running the script from the command line (passing no arguments)

$ python addtwo.py
Traceback (most recent call last):
  File "addtwo.py", line 3, in <module>
firstint = int(sys.argv[1])
IndexError: list index out of range

The above error occurred because the program asks for items at subscripts sys.argv[1] and sys.argv[2], but because no elements existed at those indices, Python raised an IndexError exception.


Reading from Files

File can be rendered as a single string or a list of strings. Strings can be split into fields.


file (TextIOWrapper) object

# read(): file text as a single strings
fh = open('students.txt')  # file object allows reading
text = fh.read()                          # read() method called on
                                          # file object returns a string

fh.close()                                # close the file

print(text)                               # single string, entire text


# readlines():  file text as a list of strings
fh = open('students.txt')
file_lines = fh.readlines()               # file.readlines() returns
                                          # a list of strings

fh.close()                                # close the file

print(file_lines)                         # list of strings,
                                          # each string a line

Writing and Appending to Files

Files can be opened for writing or appending; we use the file object and the file write() method.


fh = open('new_file.txt', 'w')
fh.write("here's a line of text\n")
fh.write('I add the newlines explicitly if I want to write to the file\n')
fh.close()

lines = open('new_file.txt').readlines()
print(lines)
  # ["here's a line of text\n",
  #  'I add the newlines explicitly if I want to write to the file\n']

Note that we are explicitly adding newlines to the end of each line. The write() method doesn't do this for us.


Reading directories with os.listdir()

os.listdir() can read any directory, but the filename must be appended to the directory path in order for Python to find it.


import os                  # os ('operating system') module talks to the os
mydirectory = '/Users/dblaikie'

for item in os.listdir(mydirectory):

    item_path = os.path.join(mydirectory, item)

    print(item_path)                    # /Users/dblaikie/photos/
                                        # /Users/dblaikie/backups/
                                        # /Users/dblaikie/college_letter.docx
                                        # /Users/dblaikie/notes.txt
                                        # /Users/dblaikie/finances.xlsx

Here we see all the files in my home direcctory on my mac (/Users/dblaikie). We must use os.path.join() to join the path to the file to see the whole path to the file. os.path.join() is designed to take any two or more strings and insert a directory slash between them. It is preferred over regular string joining or concatenation because it is aware of the operating system type and inserts the correct slash (forward slash or backslash) for the operating system.


Reading directory listing type with os.path.isfile() and os.path.isdir()

os.path.isdir() and os.path.isfile() return True or False depending on whether a listing is a file or directory.


import os                         # os ('operating system') module talks
                                  # to the os (for file access & more)
mydirectory = '/Users/dblaikie'

for item in os.listdir(mydirectory):

    item_path = os.path.join(mydirectory, item)

    if os.path.isdir(item_path):
        print("{}:  {}".format(item, 'directory'))
    elif os.path.isfile(item_path):
        print("{}:  {}".format(item, 'file'))
                                                 # photos:  directory
                                                 # backups:  directory
                                                 # college_letter.docx:  file
                                                 # notes.txt:  file
                                                 # finances.xlsx:  file

Reading file size with os.path.getsize()

os.path.getsize() takes a filename and returns the size of the file in bytes


import os                        # os ('operating system') module
                                 # talks to the os (for file access & more)
mydirectory = '/Users/dblaikie'

for item in os.listdir(mydirectory):
    item_path = os.path.join(mydirectory, item)
    item_size = os.path.getsize(item_path)
    print("{}:  {} bytes".format(item_path, item_size))

Keep in mind that Python won't be able to find a file unless its path is prepended. This is why os.path.join() is so important.


Summary exception: OSError / WindowsError with os.listdir() (and a bad directory)

Python will raise an OSError exception if we try to read a directory or file that doesn't exist, or we don't have permissions to read.


import os

# user enters a file that doesn't exist
user_file = input('please enter a filename:  ')

file_size = os.listdir(os.path.getsize(user_file))


Traceback (most recent call last):
    File "getsize.py", line 5, in <modulegt;
OSError:  No such file or directory:  'mispeld.txt'

How to handle this exception? Test to see if the file exists first, or trap the exception.


Sidebar -- traverse a directory tree with os.walk()

os.walk() visits every directory in a directory tree so we can list files and folders.


import os
root_dir = '/Users'
for root, dirs, files in os.walk(root_dir):   # root string,
                                              # dirs list,
                                              # files list

    for dir in dirs:                    # loop through dirs in this directory

        print(os.path.join(root, dir))  # print full path to dir

    for file in files:                  # loop thgrough files in this dir
        print(os.path.join(root, file)) # print full path to file

os.walk does something magical (and invisible): it traverses each directory, descending to each subdirectory in turn. Every subdirectory beneath the root directory is visited in turn. So for each loop of the outer for loop, we are seeing the contents of one particular directory. Each loop gives us a new list of files and directories to look at; this represents the content of this particular directory. We can do what we like with this information, until the end of the block. Looping back, os.walk visits another directory and allows us to repeat the process.


Reading and writing zip files

zip archives are a universal compression format.


Writing to a zip file

from zipfile import ZipFile

myzip = ZipFile('spam.zip', 'w')

myzip.write('eggs.txt')
myzip.write('ham.txt')
myzip.write('bacon.txt')

myzip.close()

Reading from a zip file

myzip = ZipFile('spam.zip')
myfile = myzip.open('eggs.txt')
print(myfile.read())                         # b'I do not like Green Eggs and Ham'
myfile.close()

When reading zipfiles, you may notice that printed text has a b'' in front of the string: this is called a bytestring. A bytestring is undecoded text.


In order to decode a bytestring to a string, it must be decoded:

bytes = myfile.read()

string_text = bytes.decode()         # decode as utf-8

string_text = bytes.decode('ascii')
string_text = bytes.decode('utf-8')  # default
string_text = bytes.decode('utf-16')

This is related to Python's support for unicode. Unicode can be a confusing process but a good rule of thumb is to start with 'ascii', then 'utf-8', then 'utf-16' and choose the first one that does not raise a UnicodeDecodeError.




csv module for CSV Files, openpyxl for Excel files and urllib module for accessing files from the web

Importing Python Modules

A module is Python code (a code library) that we can import and use in our own code -- to do specific types of tasks.


The datetime module provides convenient handling of dates.

import datetime           # make datetime (a library module) part of our code

dt = datetime.date.today()           # generate a new date object (dt)
print(dt)                            # prints today date in YYYY-MM-DD format
dt = dt + datetime.timedelta(days=1)
print(dt)                            # prints tomorrow's date

Once a module is imported, its Python code is made available to our code. We can then call specialized functions and use objects to accomplish specialized tasks. Python's module support is profound and extensive. Modules can do powerful things, like manipulate image or sound files, munge and process huge blocks of data, do statistical modeling and visualization (charts) and much, much, much more. The Python 3 Standard Library documentation can be found at https://docs.python.org/3/library/index.html Python 2 Standard Library: https://docs.python.org/2.7/library/index.html


CSV

The CSV module parses CSV files, splitting the lines for us. We read the CSV object in the same way we would a file object.


import csv
fh = open('students.txt', 'rb')  # second argument: default "read"
reader = csv.reader(fh)

fh.next()                 # skip one row (useful for header lines)

for record in reader:     # loop through each row

    print('id:{};  fname:{}; lname: {}'.format(record[0], record[1],
                                               record[2]))

This module takes into account more advanced CSV formatting, such as quotation marks (which are used to allow commas within data.)


The second argument to open() ('rb') is sometimes necessary when the csv file comes from Excel, which output newlines in the Windows format (\r\n), and can confuse the csv reader.


Writing is similarly easy:

import csv
wfh = open('some.csv', 'w')
writer = csv.writer(wfh)
writer.writerow(['some', 'values', "boy, don't you like long field values?"])
writer.writerows([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
wfh.close()

Access Excel spreadsheets

Use the openpyxl module to read and write Excel


import openpyxl as ox


# read an Excel workbook from a file
wb = ox.load_workbook('revenue.xlsx')


# show list of sheets within the workbook
print(wb.sheetnames)
  # ['transactions']


# access a single sheet within the workbook
ws = wb['transactions']


# access one cell by Excel cellname
c = ws['A2']           # 'cell' object
print(c.value)          # the value in the cell

print(ws['B2'].value)   # access value in one statement

# set a value in the cell
ws['C2'] = 3


# loop through the entire worksheet and look at each row
for row in ws:

    # create a list of values from the row
    rowvals = [ c.value for c in row ]
    print(rowvals)

You can also write workbooks, change Excel formatting, and much more. See docs at https://openpyxl.readthedocs.io/en/stable/usage.html Note that another very convenient way to access Excel is through the pandas module, which we'll review later in the course.


Python as a web client: the urllib module

A Python program can take the place of a browser, requesting and downloading CSV, HTML pages and other files.


Your Python program can work like a web spider (for example visiting every page on a website looking for particular data or compiling data from the site), can visit a page repeatedly to see if it has changed, can visit a page once a day to compile information for that day, etc. urllib is a full-featured module for making web requests. Although the requests module is strongly favored by some for its simplicity, it has not yet been added to the Python builtin distribution.


The urlopen method takes a url and returns a file-like object that can be read() as a file:

import urllib.request
my_url = 'http://www.google.com'
readobj = urllib.request.urlopen(my_url)  # return a 'file-like' object
text = readobj.read()                     # read into a 'byte string'
# text = text.decode('utf-8')             # optional, sometimes required:
                                          # decode as a 'str' (see below)
readobj.close()

Alternatively, you can call readlines() on the object (keep in mind that many objects that can deliver file-like string output can be read with this same-named method):

for line in readobj.readlines():
  print(line)
readobj.close()

POTENTIAL ERRORS AND REMEDIES WITH urllib


TypeError mentioning 'bytes' -- sample exception messages:

TypeError: can't use a string pattern on a bytes-like object
TypeError: must be str, not bytes
TypeError: can't concat bytes to str

These errors indicate that you tried to use a byte string where a str is appropriate.


The urlopen() response usually comes to us as a special object called a byte string. In order to work with the response as a string, we can use the decode() method to convert it into a string with an encoding.

text = text.decode('utf-8')

'utf-8' is the most common encoding, although others ('ascii', 'utf-16', 'utf-32' and more) may be required.


I have found that we do not always need to convert (depending on what you will be doing with the returned string) which is why I commented out the line in the first example. SSL Certificate Error Many websites enable SSL security and require a web request to accept and validate an SSL certificate (certifying the identity of the server). urllib by default requires SSL certificate security, but it can be bypassed (keep in mind that this may be a security risk).


import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

my_url = 'http://www.nytimes.com'
readobj = urllib.request.urlopen(my_url, context=ctx)

Encoding Parameters: urllib.requests.urlencode()

When including parameters in our requests, we must encode them into our request URL. The urlencode() method does this nicely:


import urllib.request, urllib.parse

params = urllib.parse.urlencode({'choice1': 'spam and eggs',
                                 'choice2': 'spam, spam, bacon and spam'})
print("encoded query string: ", params)
f = urllib.request.urlopen("http://www.google.com?{}".format(params))
print(f.read())

this prints:

encoded query string:
choice1=spam+and+eggs&choice2=spam%2C+spam%2C+bacon+and+spam

choice1:  spam and eggs<BR>
choice2:  spam, spam, bacon and spam<BR>



Relational Databases and SQL Part 1

Starting the SQLite3 Client and Opening a Database Archive

Most databases come with a tool for issuing SQL queries from a command prompt.


Special Note for Windows Users: the Sqlite3 client must be installed.

  1. You can find sqlite3.exe in the session_1_working_files/ folder under source data linked from the home page.
  2. The simplest way to begin is to place this file in the same directory as the one in which you are working.
  3. For a permanent install, place the file within your Windows system's %PATH%; see me for more information


You can start the SQLite3 client in one of two ways:

  1. Open a new database archive: sqlite3 new.db
  2. Open an existing database archive: sqlite3 session_1.db
  3. Open an "in-memory" database (which can be saved later): sqlite3


Note carefully that the syntax for opening a new file and opening an existing file are the same! This means that if you intend to open a new file but misspell the name, SQLite3 will simply create a new file; you'll then be confused to see that the file you thought you opened has nothing in it! Special Note: sqlite3 client column formatting


At the start of your session, issue the following two commands -- these will format your sqlite3 output so it is clearer, and add columns headers.

sqlite> .mode column
sqlite> .headers on

sqlite> SELECT * FROM revenue;

    # company     state       price
    # ----------  ----------  ----------
    # Haddad's    PA          239.5
    # Westfield   NJ          53.9
    # The Store   NJ          211.5
    # Hipster's   NY          11.98
    # Dothraki F  NY          5.98
    # Awful's     PA          23.95
    # The Clothi  NY          115.2

RDBMS (Relational Database Management System) Tables

A relational database stores data in tabular form, similar to CSV, Excel, etc.


A Database Table is a tabular structure (like CSV) stored in binary form which can only be displayed through a database client or database driver (for a language like Python) The database client is provided by the database and allows command-prompt access to the database. Every database provides its own client. A database driver is a module that provides programmatic access to the database. Python has a full suite of drivers for most major databases (mariadb, oracle, postrgres, etc.) Database Tables


A mysql database table description looks like this:

+-------+--------+------+-----+---------+-------+
| Field | Type   | Null | Key | Default | Extra |
+-------+--------+------+-----+---------+-------+
| date  | int(8) | YES  |     | NULL    |       |
| mktrf | float  | YES  |     | NULL    |       |
| hml   | float  | YES  |     | NULL    |       |
| smb   | float  | YES  |     | NULL    |       |
| rf    | float  | YES  |     | NULL    |       |
+-------+--------+------+-----+---------+-------+

"Field" is the column name. "Type" specifies the required data type for that column.


In sqlite3 the database table description looks like this:

CREATE TABLE ff_table (date INTEGER, mktrf FLOAT, hml FLOAT, smb FLOAT, rf FLOAT)

The available database column types are defined by the specific database. Most are very similar, with small variations between databases.


SQLite3 (file-based) and mariadb/mysql (traditional db) RDBMSs

We might use SQLite3 to prototype our tables, and mariadb/mysql for production.


mariadb (formerly known as mysql, can be colloquially referred to as either although the <) is a production-quality RDBMS that is free, and fully embraced by the industry. It supports most SQL statements and is very similar to Oracle, postgres, SQL Server, etc. SQLite is a file-based RDBMS that is extremely lightweight and requires no installation. It also supports many SQL statements although its data types are more limited. It is claimed to be the most-used database on the planet because it can work in very small environments (for example "internet of things" devices.) It is also ideal for learning. If you are new to databses, you are recommended to use SQLite for your coursework here. If you wish to try mysql/mariadb, you must install it and you'll need to do some research to complete some steps toward completing the homework. Basic Commands for SQLite3 and mysql/MariaDB

SQLiteMariaDB/mysql


start the command-prompt utility

$ sqlite3


 

$ mysql


show existing databases

(in SQLite, each file is a database,
so opening a file (below) provides
this connection)


 

Maria DB []> show databases;


connect to a database

sqlite> .open sqlite3_trades.db


 

Maria DB []> use trades;


show tables in the database

sqlite> .tables


 

Maria DB [trades]> show tables;


describe a table

sqlite> .schema stocks


 

Maria DB [trades]> desc stocks;


select specified columns from the table

sqlite> SELECT date, trans_type,
        symbol, qty FROM stocks;


 

Maria DB [trades]> SELECT date, trans_type,
                   symbol, qty FROM stocks;


select ALL columns from the table

sqlite> SELECT * FROM stocks;


 

Maria DB [trades]> SELECT * FROM stocks;


select only rows WHERE a value is found

sqlite> SELECT date, symbol, qty
        FROM stocks
        WHERE trans_type = 'BUY';


 

Maria DB [trades]> SELECT date, symbol, qty
                   FROM stocks
                   WHERE trans_type = 'BUY';


INSERT INTO: insert a row

sqlite> INSERT INTO stocks
        (date, trans_type, symbol, qty)
        VALUES ('2015-01-01', 'BUY', 'MSFT', 10);


 

Maria DB [trades]> INSERT INTO stocks
                   (date, trans_type, symbol, qty)
                   VALUES ('2015-01-01', 'BUY', 'MSFT', 10);


connect to database from Python

import sqlite3
conn = sqlite3.connect('sqlite3_trades.db')

c = conn.cursor()


 

import pymysql

host = 'localhost'
database = 'test'
port = 3306
username = 'ta'
password = 'pepper'

conn = pymysql.connect(host=host,
                       port=port,
                       user=username,
                       passwd=password,
                       db=database)

cur = conn.cursor()


execute a SELECT query from Python

c.execute('SELECT date, symbol, qty
           FROM stocks
           WHERE trans_type = 'BUY')


 

cur.execute('SELECT date, symbol, qty
             FROM stocks
             WHERE trans_type = 'BUY')


retrieve one row or many rows from a result set

rs = c.execute('SELECT * FROM stocks
                ORDER BY price')

tuple_row = rs.fetchone()      # tuple row

tuple_row = rs.fetchmany(3)    # list of tuple rows - specify # of rows

tuple_rows = rs.fetchall()     # list of tuple rows - entire result set


 

rs = c.execute('SELECT * FROM stocks
                ORDER BY price')

tuple_row = rs.fetchone()      # tuple row

tuple_rows = rs.fetchmany()    # list of tuple rows


iterate over database result set

rs = c.execute('SELECT * FROM stocks
                ORDER BY price')

for tuple_row in rs:
    print(tuple_row)


 

for tuple_row in cur:
    print(tuple_row)


insert a row

rs = c.execute("INSERT INTO stocks
                (date, trans_type, symbol, qty)
                VALUES ('2015-01-01', 'BUY', 'MSFT', 10)")


 

rs = c.execute("INSERT INTO stocks
                (date, trans_type, symbol, qty)
                VALUES ('2015-01-01', 'BUY', 'MSFT', 10)")


list common commands

.help


 

here are some mysql "important commands"




sqlite3 from Python

The sqlite3 module provides programmatic access to sqlite3 databases.


Keep in mind that the interface you use for SQLite3 will be very similar to one that you use for other databases such as mysql/MariaDB.


import sqlite3
conn = sqlite3.connect('example.db')  # a db connection object

c = conn.cursor()                     # a cursor object for issuing queries

Once a cursor object is established, SQL can be used to write to or read from the database:

c.execute('''CREATE TABLE stocks
             (date text, trans text, symbol text, qty real, price real)''')

Note that sqlite3 datatypes are nonstandard and don't exactly match types found in databases such as mysql/MariaDB:

INTEGERall int types (TINYINT, BIGINT, INT, etc.)
REALFLOAT, DOUBLE, REAL, etc.
NUMERICDECIMAL, BOOLEAN, DATE, DATETIME, NUMERIC
TEXTCHAR, VARCHAR, etc.
BLOBBLOB (non-typed (binary) data, usually large)


Retrieve on row of data: .fetchone()

rs = c.execute("SELECT * FROM revenue WHERE company  = 'The Store'")

row = rs.fetchone()

print(row)
### ('The Store', 'NJ', 211.5)   # one tuple row

Retrieve all rows of a result set: .fetchall()

rs = c.execute("SELECT * FROM stocks WHERE ticker = 'RHAT'")

rows = rs.fetchall()

print(rows)

### [(u'2006-01-05', u'BUY', u'RHAT', 100, 35.14),   # a list of tuples
### (u'2006-01-05', u'BUY', u'RHAT', 100, 45.08)]

Retrieve some rows of a result set: .fetchmany()

rs = c.execute('SELECT * FROM stocks')

rows = rs.fetchmany(3)
### as above - a list of tuples

with this approach, we can control how much data is held in memory at a time


Loop through multiple rows of data: iteration over the result set

rs = c.execute('SELECT * FROM stocks')
for tuple_row in rs:
    print(tuple_row)

### (u'2006-01-05', u'BUY', u'RHAT', 100, 35.14)
### (u'2006-03-28', u'BUY', u'IBM', 1000, 45.0)
### (u'2006-04-06', u'SELL', u'IBM', 500, 53.0)
### (u'2006-04-05', u'BUY', u'MSFT', 1000, 72.0)

this also loads only one row of data into memory at a time


Close the database

conn.close()



Sequence Manipulation

Identifying Sequences and Iterables

A sequence is an iterable container of items; an iterable is anything that can be looped through


So far we have explored the reading and parsing of data; the loading of data into built-in structures; and the aggregation and sorting of these structures. This session explores advanced tools for container processing.


set operations

a = set(['a', 'b', 'c'])
b = set(['b', 'c', 'd'])
print(a.difference(b))      # set(['a'])
print(a.union(b))           # set(['a', 'b', 'c', 'd'])
print(a.intersection(b))    # set(['b', 'c'])

list comprehensions

a = ['hello', 'there', 'harry']
print([ var.upper() for var in a if var.startswith('h') ])
                           # ['HELLO', 'HARRY']

lambda functions

names = ['Joe Wilson', 'Pete Johnson', 'Mary Rowe']
sorted_names = sorted(names, key=lambda x: x.split()[1])
print(sorted_names)             # ['Pete Johnson', 'Mary Rowe', 'Joe Wilson']

ternary assignment

rev_sort = True if user_input == 'highest' else False

pos_val = x if x >= 0 else x * -1

conditional assignment

val = this or that       # 'this' if this is True else 'that'
val = this and that      # 'this' if this is False else 'that'

Container processing: Set Comparisons

We have used the set to create a unique collection of objects. The set also allows comparisons of sets of objects. Methods like set.union (complete member list of two or more sets), set.difference (elements found in this set not found in another set) and set.intersection (elements common to both sets) are fast and simple to use.


set_a = set([1, 2, 3, 4])
set_b = set([3, 4, 5, 6])

print(set_a.union(set_b))           # set([1, 2, 3, 4, 5, 6])  (set_a + set_b)
print(set_a.difference(set_b))      # set([1, 2])              (set_a - set_b)
print(set_a.intersection(set_b))    # set([3, 4])  (what is common between them?)

List comprehensions: filtering a container's elements

List comprehensions abbreviate simple loops into one line.


Consider this loop, which filters a list so that it contains only positive integer values:

myints = [0, -1, -5, 7, -33, 18, 19, 55, -100]
myposints = []
for el in myints:
  if el > 0:
    myposints.append(el)

print(myposints)                   # [7, 18, 19, 55]

This loop can be replaced with the following one-liner:

myposints = [ el for el in myints if el > 0 ]

See how the looping and test in the first loop are distilled into the one line? The first el is the element that will be added to myposints - list comprehensions automatically build new lists and return them when the looping is done.


The operation is the same, but the order of operations in the syntax is different:

# this is pseudo code
# target list = item for item in source list if test

Hmm, this makes a list comprehension less intuitive than a loop. However, once you learn how to read them, list comprehensions can actually be easier and quicker to read - primarily because they are on one line. This is an example of a filtering list comprehension - it allows some, but not all, elements through to the new list.


List comprehensions: transforming a container's elements

Consider this loop, which doubles the value of each value in it:


nums = [1, 2, 3, 4, 5]
dblnums = []
for val in nums:
  dblnums.append(val*2)

print(dblnums)                          # [2, 4, 6, 8, 10]

This loop can be distilled into a list comprehension thusly:

dblnums = [ val * 2 for val in nums ]

This transforming list comprehension transforms each value in the source list before sending it to the target list:

# this is pseudo code
# target list = item transform for item in source list

We can of course combine filtering and transforming:

vals = [0, -1, -5, 7, -33, 18, 19, 55, -100]
doubled_pos_vals = [ i*2 for i in vals if i > 0 ]
print(doubled_pos_vals)                # [14, 36, 38, 110]

List comprehensions: examples

If they only replace simple loops that we already know how to do, why do we need list comprehensions? As mentioned, once you are comfortable with them, list comprehensions are much easier to read and comprehend than traditional loops. They say in one statement what loops need several statements to say - and reading multiple lines certainly takes more time and focus to understand.


Some common operations can also be accomplished in a single line. In this example, we produce a list of lines from a file, stripped of whitespace:

stripped_lines = [ i.rstrip() for i in open('FF_daily.txt').readlines() ]

Here, we're only interested in lines of a file that begin with the desired year (1972):

totals = [ i for i in open('FF_daily.txt').readlines() if i.startswith('1972') ]

If we want the MktRF values for our desired year, we could gather the bare amounts this way:

mktrf_vals = [ float(i.split()[1]) for i in open('FF_daily.txt').readlines() if i.startswith('1972') ]

And in fact we can do part of an earlier assignment in one line -- the sum of MktRF values for a year:

mktrf_sum = sum([ float(i.split()[1]) for i in open('FF_daily.txt').readlines() if i.startswith('1972') ])

From experience I can tell you that familiarity with these forms make it very easy to construct and also to decode them very quickly - much more quickly than a 4-6 line loop.


List Comprehensions with Dictionaries

Remember that dictionaries can be expressed as a list of 2-element tuples, converted using items(). Such a list of 2-element tuples can be converted back to a dictionary with dict():


mydict =  {'a': 5, 'b': 0, 'c': -3, 'd': 2, 'e': 1, 'f': 4}

my_items = list(mydict.items())      # my_items is now [('a',5), ('b',0), ('c',-3), ('d',2), ('e',1), ('f',4)]
mydict2 = dict(my_items)       # mydict2 is now   {'a':5,   'b':0,   'c':-3,   'd':2,   'e':1,   'f':4}

It becomes very easy to filter or transform a dictionary using this structure. Here, we're filtering a dictionary by value - accepting only those pairs whose value is larger than 0:

mydict = {'a': 5, 'b': 0, 'c': -3, 'd': 2, 'e': -22, 'f': 4}
filtered_dict = dict([ (i, j) for (i, j) in list(mydict.items()) if j > 0 ])

Here we're switching the keys and values in a dictionary, and assigning the resulting dict back to mydict, thus seeming to change it in-place:

mydict = dict([ (j, i) for (i, j) in list(mydict.items()) ])

The Python database module returns database results as tuples. Here we're pulling two of three values returned from each row and folding them into a dictionary.

# 'tuple_db_results' simulates what a database returns
tuple_db_results = [
  ('joe', 22, 'clerk'),
  ('pete', 34, 'salesman'),
  ('mary', 25, 'manager'),
]

names_jobs = dict([ (name, role) for name, age, role in tuple_db_results ])

Sorting Multidimensional Structures

Having built multidimensional structures in various configurations, we should now learn how to sort them -- for example, to sort the keys in a dictionary of dictionaries by one of the values in the inner dictionary (in this instance, the last name):


def by_last_name(key):
    return dod[key]['lname']

dod = {
         'db13':  {
                     'fname': 'Joe',
                     'lname': 'Wilson',
                     'tel':   '9172399895'
                  },
         'mm23':  {
                     'fname': 'Mary',
                     'lname': 'Doodle',
                     'tel':   '2122382923'
                  }
       }

sorted_keys = sorted(dod, key=by_last_name)
print(sorted_keys)                             # ['mm23', 'db13']

The trick here will be to put together what we know about obtaining the value from an inner structure with what we have learned about custom sorting.


Sorting review

A quick review of sorting: recall how Python will perform a default sort (numeric or ASCII-betical) depending on the objects sorted. If we wish to modify this behavior, we can pass each element to a function named by the key= parameter:


mylist = ['Alpha', 'Gamma', 'episilon', 'beta', 'Delta']

print(sorted(mylist))                      # ASCIIbetical sort
                                          # ['Alpha', 'Gamma', 'Delta', 'beta', 'epsilon']

mylist.sort()                             # sort mylist in-place

print(sorted(mylist, key=str.lower))       # alphabetical sort
                                          # (lowercasing each item by telling Python to pass it
                                          # to str.lower)
                                          # ['Alpha', 'beta', 'Delta', 'epsilon', 'Gamma']

print(sorted(mylist, key=len))             # sort by length
                                          # ['beta', 'Alpha', 'Gamma', 'Delta', 'epsilon']

Sorting review: sorting dictionary keys by value: dict.get

When we loop through a dict, we can loop through a list of keys (and use the keys to get values) or loop through items, a list of (key, value) tuple pairs. When sorting a dictionary by the values in it, we can also choose to sort keys or items.


To sort keys, mydict.get is called with each key - and get returns the associated value. So the keys of the dictionary are sorted by their values.

mydict = { 'a': 5, 'b': 2, 'c': 1, 'z': 0 }
mydict_sorted_keys = sorted(mydict, key=mydict.get)
for i in mydict_sorted_keys:
    print("{0} = {1}".format(i, mydict[i]))

                 ## z = 0
                 ## c = 1
                 ## b = 2
                 ## a = 5

Sorting dictionary items by value: operator.itemgetter

Recall that we can render a dictionary as a list of tuples with the dict.items() method:


mydict = { 'a': 5, 'b': 2, 'c': 1, 'z': 0 }
mydict_items = list(mydict.items())                        # [(a, 5), (c, 1), (b, 2), (z, 0)]

To sort dictionary items by value, we need to sort each two-element tuple by its second element. The built-in module operator.itemgetter will return whatever element of a sequence we wish - in this way it is like a subscript, but in function format (so it can be called by the Python sorting algorithm).

import operator
mydict = { 'a': 5, 'b': 2, 'c': 1, 'z': 0 }
mydict_items = list(mydict.items())                        # [(a, 5), (c, 1), (b, 2), (z, 0)]
mydict_items.sort(key=operator.itemgetter(1))
print(mydict_items)                                   # [(z, 0), (c, 1), (b, 2), (a, 5)]
for key, val in mydict_items:
    print("{0} = {1}".format(key, val))

                    ## z = 0
                    ## c = 1
                    ## b = 2
                    ## a = 5

The above can be conveniently combined with looping, effectively allowing us to loop through a "sorted" dict:

for key, val in sorted(list(mydict.items()), key=operator.itemgetter(1)):
    print("{0} = {1}".format(key, val))

Database results come as a list of tuples. Perhaps we want our results sorted in different ways, so we can store as a list of tuples and sort using operator.itemgetter. This example sorts by the third field, then by the second field (last name, then first name):

import operator
items =[ (123, 'Joe', 'Wilson', 35, 'mechanic'),
         (124, 'Sam', 'Jones', 22, 'mechanic'),
         (125, 'Pete', 'Jones', 40, 'mechanic'),
         (126, 'Irina', 'Bibi', 31, 'mechanic'),
       ]
items.sort(key=operator.itemgetter(2,1)) # sorts by last, first name
for this_pair in items:
  print("{0} {1}".format(this_pair[1], this_pair[2]))

       ## Irina Bibi
       ## Pete Jones
       ## Sam Jones
       ## Joe Wilson

Multi-dimensional structures: sorting with custom function

Similar to itemgetter, we may want to sort a complex structure by some inner value - in the case of itemgetter we sorted a whole tuple by its third value. If we have a list of dicts to sort, we can use the custom sub to specify the sort value from inside each dict:


def by_dict_lname(this_dict):
  return this_dict['lname'].lower()

list_of_dicts = [
  { 'id': 123,
    'fname': 'Joe',
    'lname': 'Wilson',
  },
  { 'id': 124,
    'fname': 'Sam',
    'lname': 'Jones',
  },
  { 'id': 125,
    'fname': 'Pete',
    'lname': 'abbott',
  },
]
list_of_dicts.sort(key=by_dict_lname)      # custom sort function (above)
for this_dict in list_of_dicts:
  print("{0} {1}".format(this_dict['fname'], this_dict['lname']))

# Pete abbot
# Sam Jones
# Joe Wilson

So, although we are sorting dicts, our sub says "take this dictionary and sort by this inner element of the dictionary".


Multi-dimensional structures: sorting with lambda custom function

Functions are useful but they require that we declare them separately, elsewhere in our code. A lambda is a function in a single statement, and can be placed in data structures or passed as arguments in function calls. The advantage here is that our function is used exactly where it is defined, and we don't have to maintain separate statements.


A common use of lambda is in sorting. The format for lambdas is lambda arg: return_val. Compare each pair of regular function and lambda, and note the argument and return val in each.


def by_lastname(name):
  fname, lname = name.split()
  return lname

names = [ 'Josh Peschko', 'Gabriel Feghali', 'Billy Woods', 'Arthur Fischer-Zernin' ]
sortednames = sorted(names, key=lambda name:  name.split()[1])


list_of_dicts = [
  { 'id': 123,
    'fname': 'Joe',
    'lname': 'Wilson',
  },
  { 'id': 124,
    'fname': 'Sam',
    'lname': 'Jones',
  },
  { 'id': 125,
    'fname': 'Pete',
    'lname': 'abbott',
  },
]

def by_dict_lname(this_dict):
  return this_dict['lname'].lower()

sortedlenstrs = sorted(list_of_dicts, key=lambda this_dict:  this_dict['lname'].lower())

In each, the label after lambda is the argument, and the expression that follows the colon is the return value. So in the first example, the lambda argument is name, and the lambda returns name.split()[1]. See how it behaves exactly like the regular function itself? Again, what is the advantage of lambdas? They allow us to design our own functions which can be placed inline, where a named function would go. This is a convenience, not a necessity. But they are in common use, so they must be understood by any serious programmer.


Lambda expressions: breaking them down

Many people have complained that lambdas are hard to grok (absorb), but they're really very simple - they're just so short they're hard to read. Compare these two functions, both of which add/concatenate their arguments:


def addthese(x, y):
  return x + y

addthese2 = lambda x, y:  x + y

print(addthese(5, 9))        # 14
print(addthese2(5, 9))       # 14

The function definition and the lambda statement are equivalent - they both produce a function with the same functionality.


Lambda expression example: dict.get and operator.itemgetter

Here are our standard methods to sort a dictionary:


import operator
mydict = { 'a': 5, 'b': 2, 'c': 1, 'z': 0 }
for key, val in sorted(list(mydict.items()), key=operator.itemgetter(1)):
    print("{0} = {1}".format(key, val))

for key in sorted(mydict, key=mydict.get):
    print("{0} = {1}".format(key, mydict[key]))

Imagine we didn't have access to dict.get and operator.itemgetter. What could we do?

mydict = { 'a': 5, 'b': 2, 'c': 1, 'z': 0 }
for key, val in sorted(list(mydict.items()), key=lambda keyval:  keyval[1]):
    print("{0} = {1}".format(key, val))

for key in sorted(mydict, key=lambda key:  mydict[key]):
    print("{0} = {1}".format(key, mydict[key]))

These lambdas do exactly what their built-in counterparts do: in the case of operator.itemgetter, take a 2-element tuple as an argument and return the 2nd element in the case of dict.get, take a key and return the associated value from the dict




User-Defined Functions and Code Organization

User-Defined Functions are named code blocks

The function block is executed every time the function's name is called.


def print_hello():
    print("Hello, World!")

print_hello()             # prints 'Hello, World!'
print_hello()             # prints 'Hello, World!'
print_hello()             # prints 'Hello, World!'

When we run this program, we see the greeting printed three times. One advantage of this is that when we want the greeting printed, we can call the function by name instead of actually printing the statement ourselves. Just as importantly, if we wanted to change our greeting so that it said "Hello, Earth!" instead, we would just change it in the function - we wouldn't have to change it three times.


Function Argument(s)

Any argument(s) passed to a function are aliased to variable names inside the function definition.


def print_hello(greeting, person):    # 2 strings aliased to objects
                                      # passed in the call

    full_greeting = "{}, {}!".format(greeting, person)
    print(full_greeting)

print_hello('Hello', 'World')         # pass 2 strings:  prints "Hello, World!"
print_hello('Bonjour', 'Python')      # pass 2 strings:  prints "Bonjour, Python!"
print_hello('squawk', 'parrot')       # pass 2 strings:  prints "squawk, parrot!"

The return Statement Returns a Value

Object(s) are returned from a function using the return statement.


def print_hello(greeting, person):
    full_greeting = greeting + ", " + person + "!"
    return full_greeting

msg = print_hello('Bonjour', 'parrot')   # full_greeting
                                         # aliased to msg

print(msg)                                # 'Bonjour, parrot!'

Dividing a Program into Steps

We should plan our code design around several discrete operations. Each of these will translate to a function.


take input Input may come from the user's keyboard or program arguments (discussed in a later unit). validate input (isdigit(), valid choice, etc.) Since the user may provide "bad" or incorrect input, we must test the input and return an error message if invalid. normalize input (str->float, etc.) The user's input may be correct, but not in a form useful to us -- she/he may be asked to input numbers, but those numbers will come to us as strings. In this step we would convert input to a usable form. read data; normalize data Data may come from a file, database or other networked resource, and it is usually not in the form in which we'd like to use it. In these steps we read the data and then transform / convert data into a usable form. This may involve selecting portions of the data and adding to a container structure (coming soon). perform calculation(s) With the data in our preferred form, we can now perform calculations to produce a desired result -- adding numbers togther, examining or transforming strings, etc. report result Finally, we need to output the result to screen, a file, a web page, etc. This is the step in which the program's results are displayed to the user or written to a file or database.


Rendering Coding Steps as Functions

Coding steps can be rendered as functions; the "main body" code controls these functions


In this solution to the tip calculator, we have marked each step in the code with an ALL CAPS comment:

# TAKE INPUT
bill_amt = input('Please enter the total bill amount:  ')
party_size =  input('Please enter the number in your party:  ')
tip_pct = input('Please enter the desired tip percentage (for example, "20" for 20%):  ')

# NORMALIZE INPUT
bill_amt = float(bill_amt)
party_size = int(party_size)
tip_pct = float(tip_pct)

# PERFORM CALCULATIONS
tip_amt = bill_amt * tip_pct * .01
total_bill = bill_amt + tip_amt
person_share = total_bill / party_size

#REPORT RESULT
print('A ' + str(tip_pct) + '% tip ($' + str(tip_amt) + ') was added to the bill, for a total of $' + str(total_bill))
print('With ' + str(party_size) + ' in your party, each person must pay $' + str(person_share))

(There is no read/normalize data step in this example.)


Overall Program Structure

Here is the same solution organized into functions, along with function comments:


# CONSTANT values can be used inside functions
TOTAL_QUERY =      'Please enter the total bill amount:  '
PARTY_SIZE_QUERY = 'Please enter the number in your party:  '
TIP_PCT_QUERY =    'Please enter the desired tip percentage (for example, "20" for 20%):  '

MSG1 = 'A {}% tip (${}) was added to the bill, for a total of ${}.'
MSG2 = 'With {} in your party, each person must pay ${}.'


def take_input():
    """ take keyboard input """

    bill_amt = input(TOTAL_QUERY)
    party_size =  input(PARTY_SIZE_QUERY)
    tip_pct = input(TIP_PCT_QUERY)

    return bill_amt, party_size, tip_pct


def normalize_input(bill_amt, party_size, tip_pct):
    """ convert user inputs to float and int """

    bill_amt = float(bill_amt)
    party_size = int(party_size)
    tip_pct = float(tip_pct)

    return bill_amt, party_size, tip_pct


def perform_calculations(bill_amt, party_size, tip_pct):
    """ calculate tip amount, total bill and person's share """

    tip_amt = bill_amt * tip_pct * .01
    total_bill = bill_amt + tip_amt
    person_share = total_bill / party_size

    return tip_amt, total_bill, person_share

def main():
    bill, size, pct = take_input()

    bill_num, size_num, pct_num = normalize_input(bill, size, pct)

    tip_amt, total_bill, person_share = perform_calculations(bill_num, size_num, pct_num)

    print(MSG1.format(pct, tip_amt, total_bill))
    print(MSG2.format(size, person_share))

if __name__ == '__main__':            # 'main body' code (not in a function)
    main()

* The CONSTANT VALUES are values that you know will not change during program execution, but might be changed by the programmer at some point in the future. In the above case, we are showing display text. We set these at the top for reference, because they aren't really part of the logic of the code. * The functions make up the main actions that our code takes. Most of code will be placed within functions. * The if __name__ == '__main__': is a special if test that holds the "main body" code. This is necessary so that the testing program can properly call this code's functions without calling the rest of the code.


"Pure" functions

Functions should only use argument values. They may sometimes use "constant" values defined at the top of the script. They must not refer to global variables defined elsewhere in the "main body" code.


arguments are variables passed to functions local variables are those defined / initialized inside a function global variables are ones defined / initialized outside of a function constant variables are those defined in ALL_CAPS at the top of the script. They are not expected to be changed or reassigned anywhere in the script.


The name of this program is doubler.py (caps in comments are for emphasis)

PI = 3.14                 # PI is a CONSTANT


def take_input():
    multi = input('please enter a multiplier: ')
    imulti = int(multi)
    return imulti

def multipi(val):          # val is an ARGUMENT
    dblval = PI * val      # dblval is a LOCAL variable
    return dblval


if __name__ == '__main__':
    uval = take_input()
    dval = multipi(uval)   # dval is a GLOBAL variable
    print('PI * {} = {}'.format(uval, dval))

Keep in mind that we could easily have used imulti directly inside doubleit() -- instead, we passed the global to the function. The use of a global inside a function is a cardinal transgression! Why? Because allowing this can introduce errors into code; also, it can render the function untestable.




Primary Keys

The Primary Key is the Unique "id" for a Block of Data

A db table has a unique key; a dictionary's keys qualify as primary keys.


SQL


A database table description may include a unique field that often represents an id for a row (although it can be any unique identifier).

+-----------------------+---------------+------+-----+---------+----------------+
| Field                 | Type          | Null | Key | Default | Extra          |
+-----------------------+---------------+------+-----+---------+----------------+
| stu_id                | int(11)       | NO   | PRI | NULL    | auto_increment |
| username              | varchar(50)   | YES  |     | NULL    |                |
| password              | varchar(50)   | YES  |     | NULL    |                |
| first_name            | varchar(50)   | YES  |     | NULL    |                |
| last_name             | varchar(50)   | YES  |     | NULL    |                |
+-----------------------+---------------+------+-----+---------+----------------+

In the above example, the stu_id field is marked as PRI, meaning that it is a unique identifier for the row -- no other row will have the same identifier. (Any PK is determined in the CREATE TABLE statement.)




Primary Key: "built-in" Python

A dict key can be thought of as a primary key.


A dictionary's keys are also unique, thus they are a "natural" primary key.

student_names = { 'dbb212':  'David Blaikie',
                  'mm64':    'Mark Meretzky',
                  'jp29':    'Jerry Pacemaker' }

Since structures can be nested, a dict of dicts or dict of lists can key an id to data of any size or complexity.

students = { 'dbb212':  { 'fname': 'David',
                          'lname': 'Blaikie',
                          'GPA'  : 3.8        },

             'jw23':    { 'fname': 'Joe',
                          'lname': 'Wilson',
                          'GPA':   3.7        }
            }

Think of the above structure as a table row -- a unique identifer (i.e., a primary key) keyed to same-named dictionaries. This dict can also be expressed as a 4-column table (id, fname, lname, GPA).






SQL Part 2: Primary Key, JOIN, GROUP BY, ORDER BY

Reminder: sqlite3 client column formatting

Use these sqlite3 commands to format your output readably.


At the start of your session, issue the following two commands -- these will format your sqlite3 output so it is clearer, and add columns headers.

sqlite> .mode column
sqlite> .headers on

Now output is clearly lined up with column heads displayed:

sqlite> SELECT * FROM revenue;

    # company     state       price
    # ----------  ----------  ----------
    # Haddad's    PA          239.5
    # Westfield   NJ          53.9
    # The Store   NJ          211.5
    # Hipster's   NY          11.98
    # Dothraki F  NY          5.98
    # Awful's     PA          23.95
    # The Clothi  NY          115.2

However, note that each column is only 10 characters wide. It is possible to change these widths although not usually necessary.


Primary Key (PK) in a Table

The PK is defined in the CREATE TABLE definition.


Here's a table description in SQLite for a table that has a "instructor_id" primary key:

CREATE TABLE instructors ( instructor_id INT PRIMARY KEY,
                           password TEXT,
                           first_name TEXT,
                           last_name TEXT   );

The primary key in a database cannot be duplicated -- an error will occur if this is attempted.


JOINing Tables on a Primary Key

Two tables may have info keyed to the same primary key -- these can be joined into one table.


Relational database designs attempt to separate data into individual tables, in order to avoid repetition. For example, consider one table that holds data for instructors at a school (in which one instructor appears per row) and another that holds records of a instructor's teaching a class (in which the same instructor may appear in multiple rows).


Here is a CREATE TABLE description for tables instructors and instructor_classes. instructors contains:

sqlite3> .schema instructors
CREATE TABLE instructors ( instructor_id INT PRIMARY KEY,
                           password TEXT,
                           first_name TEXT,
                           last_name TEXT   );

sqlite3> .schema instructor_classes
CREATE TABLE instructor_classes ( instructor_id INT,
                                  class_name TEXT,
                                  day TEXT );

Select all rows from both tables:

sqlite3> SELECT * from instructors
instructor_id  password    first_name  last_name
-------------  ----------  ----------  ----------
1              pass1       David       Blaikie
2              pass2       Joe         Wilson
3              xxyx        Jenny       Warner
4              yyyy        Xavier      Yellen

sqlite> SELECT * from instructor_classes
instructor_id  class_name    day
-------------  ------------  ----------
1              Intro Python  Thursday
1              Advanced Pyt  Monday
2              php           Monday
2              js            Tuesday
3              sql           Wednesday
3              mongodb       Thursday
99             Golang        Saturday

Why is instructor_classes data separated from instructors data? If we combined all of this data into one table, there would be repetition -- we'd see the instructor's name repeated on all the rows that indicate the instructor's class assignments. So it makes sense to separate the data that has a "one-to-one" relationship of instructors to the data for each instructor (as in the instructors table) from the data that has a "many-to-one" relationship of the instructor to the data for each instructor (as in the instructor_classes table). But there are times where we will want to see all of this data shown together in a single result set -- we may see repetition, but we won't be storing repetition. We can create these combined result sets using database joins.


LEFT JOIN

all rows from "left" table, and matching rows in right table


A left join includes primary keys from the "left" table (this means the table mentioned in the FROM statement) and will include only those rows in right table that share those same keys.

sqlite3> SELECT * FROM instructors LEFT JOIN instructor_classes
         on instructors.instructor_id = instructor_classes.instructor_id;

instructor_id  password    first_name  last_name   instructor_id  class_name       day
-------------  ----------  ----------  ----------  -------------  ---------------  ----------
1              pass1       David       Blaikie     1              Advanced Python  Monday
1              pass1       David       Blaikie     1              Intro Python     Thursday
2              pass2       Joe         Wilson      2              js               Tuesday
2              pass2       Joe         Wilson      2              php              Monday
3              xxyx        Jenny       Warner      3              mongodb          Thursday
3              xxyx        Jenny       Warner      3              sql              Wednesday
4              yyyy        Xavier      Yellen

Note the missing data on the right half of the last line. The right table instructor_classes had no data for instructor id 4.


RIGHT JOIN

all rows from the "right" table, and matching rows in the left table


A right join includes primary keys from the "right" table (this means the table mentioned in the JOIN clause) and will include only those rose in the left table that share the same keys as those in the right.


Unfortunately, SQLite does not support RIGHT JOIN (although many other databases do). The workaround is to use a LEFT JOIN and reverse the table names

sqlite3> SELECT * FROM instructor_classes LEFT JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;

instructor_id  class_name    day         instructor_id  password    first_name  last_name
-------------  ------------  ----------  -------------  ----------  ----------  ----------
1              Intro Python  Thursday    1              pass1       David       Blaikie
1              Advanced Pyt  Monday      1              pass1       David       Blaikie
2              php           Monday      2              pass2       Joe         Wilson
2              js            Tuesday     2              pass2       Joe         Wilson
3              sql           Wednesday   3              xxyx        Jenny       Warner
3              mongodb       Thursday    3              xxyx        Jenny       Warner
99             Golang        Saturday

Now only rows that appear in instructor_classes appear in this table, and data not found in instructors is missing (In this case, Golang has no instructor and it is given the default id 99).


<=H>INNER JOIN and OUTER JOIN <=SH>Select only PKs common to both tables, or all PKs for all tables INNER JOIN: rows common to both tables


An inner join includes only those rows that have primary key values that are common to both tables:

sqlite3> SELECT * from instructor_classes INNER JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;
instructor_id  class_name    day         instructor_id  password    first_name  last_name
-------------  ------------  ----------  -------------  ----------  ----------  ----------
1              Intro Python  Thursday    1              pass1       David       Blaikie
1              Advanced Pyt  Monday      1              pass1       David       Blaikie
2              php           Monday      2              pass2       Joe         Wilson
2              js            Tuesday     2              pass2       Joe         Wilson
3              sql           Wednesday   3              xxyx        Jenny       Warner
3              mongodb       Thursday    3              xxyx        Jenny       Warner

Rows are joined where both instructors and instructor_classes have data.


OUTER JOIN: all rows from both tables


An outer join includes all rows from both tables, regardless of whether a PK id appears in the other table. Here's what the query would be if sqlite3 supported outer joins:

SELECT * from instructor_classes OUTER JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;

unfortunately, OUTER JOIN is not currently supported in sqlite3. In these cases it's probably best to use another approach, i.e. built-in Python or pandas merge() (to come).


Aggregating data with GROUP BY

"Aggregation" means counting, summing or otherwise summarizing multiple values based on a common key.


Consider summing up a count of voters by their political affiliation (2m Democrats, 2m Republicans, .3m Independents), a sum of revenue of companies by their sector (manufacturing, services, etc.), or an average GPA by household income. All of these require taking into account the individual values of multiple rows and compiling some sort of summary value based on those values.


Here is a sample that we'll play with:

sqlite3> SELECT date, name, rev FROM companyrev;

date        name         rev
----------  -----------  ----------
2019-01-03  Alpha Corp.  10
2019-01-05  Alpha Corp.  20
2019-01-03  Beta Corp.   5
2019-01-07  Beta Corp.   7
2019-01-09  Beta Corp.   3

If we wish to sum up values by company, we can say it easily:

sqlite3> SELECT name, sum(rev) FROM companyrev GROUP BY name;

name         sum(rev)
-----------  ----------
Alpha Corp.  30
Beta Corp.   15

If we wish to count the number entries for each company, we can say it just as easily:

sqlite3> SELECT name, count(name) FROM companyrev GROUP BY name;

name         count(name)
-----------  -----------
Alpha Corp.  2
Beta Corp.   3

Sorting a Result Set with ORDER BY

This is SQL's way of sorting results.


The ORDER BY clause indicates a single column, or multiple columns, by which we should order our results:

sqlite3> SELECT name, rev FROM companyrev ORDER BY rev;

name        rev
----------  ----------
Beta Corp.  3
Beta Corp.  5
Beta Corp.  7
Alpha Corp  10
Alpha Corp  20



Testing

Testing: Introduction

Code without tests is like driving without seatbelts.


All code is subject to errors -- not just ValueErrors and TypeErrors encountered during development, but errors related to unexpected data anomalies or user input, or the unforeseen effects of functions run in untested combinations. Unit testing is the front line of the effort to ensure code quality. Many developers say they won't take a software package seriously unless it comes with tests. testing: a brief rundown Unit testing is the most basic form and the one we will focus on here. Other styles of testing:


Unit Testing

"Unit" refers to a function. Unit testing calls individual functions and validates the output or result of each.


The most easily tested scripts are made up of small functions that can be called and validated in isolation. Therefore "pure functions" (functions that do not refer or change "external state" -- i.e., global variables) are best for testing. Testing for success, testing for failure A unit test script performs test by importing the script to be tested and calling its functions with varying arguments, including ones intended to cause an error. Basically, we are hammering the code as many ways as we can to make sure it succeeds properly and fails properly. Test-driven development As we develop our code, we can write tests simultaneously and run them periodically as we develop. This way we can know that further changes and additions are not interfering with anything we have done previously. Any time in the process we can run the testing program and it will run all tests. In fact commonly accepted wisdom supports writing tests before writing code! The test is written with the function in mind: after seeing that the tests fail, we write a function to satisfy the tests. This called test-driven development.


The assert statement

assert raises an AssertionError exception if the test returns False


assert 5 == 5        # no output

assert 5 == 10       # AssertionError raised

We can incorporate this facility in a simple testing program: program to be tested: "myprogram.py"


import sys

def doubleit(x):
    var = x * 2
    return var

if __name__ == '__main__':
    input_val = sys.argv[1]
    doubled_val = doubleit(input_val)

    print("the value of {0} is {1}".format(input_val, doubled_val))

testing program: "test_myprogram.py"


import myprogram

def test_doubleit_value():
    assert myprogram.doubleit(10) == 20

If doubleit() didn't correctly return 20 with an argument of 10, the assert would raise an AssertionError. So even with this basic approach (without a testing module like pyttest or unittest), we can do testing with assert.


pytest Basics

All programs named test_something.py that have functions named test_something() will be noticed by pytest and run automatically when we run the pytest script py.test.


instructions for running tests 1. Download the testing program test_[name].py (where name is any name) and place in the same directory as your script. 2. Open up a command prompt. In Mac/Unix, open the Terminal program (you can also use the Terminal window in PyCharm). In Windows, you must launch the Anaconda Command Prompt, which should be accessible by searching for cmd on Windows 10 -- let me know if you have trouble finding the Anaconda prompt. 3. Make sure your homework script and the test script are in the same directory, and that your Command Prompt or Terminal window is also in that directory (let me know if you have any trouble using cd to travel to this directory in your Command Prompt or Terminal window.) 4. Execute the command py.test at the command line (keep in mind this is not the Python prompt, but your Mac/Unix Terminal program or your Anaconda cmd/Command Prompt window). py.test is a special command that should work from your command line if you have Anaconda installed. (py.test is not a separate file.) 5. If your program and functions are named as directed, the testing program will import your script and test each function to see that it is providing correct output. If there are test failures, look closely at the failure output -- look for the assert test showing what values were involved in the failure. You can also look at the testing program to see what it is requiring (look for the assert statements). 6. If you see collected 0 items it means there was no test_[something].py file (where [something] is a name of your choice) or there was no test_[something]() function inside the test program. These names are required by py.test. 7. If your run of py.test hangs (i.e., prints something out but then just waits), or if you see a lot of colorful error output saying not found here and there, it may be for the above reason.


running py.test from the command line or Anaconda command prompt:

$ py.test
 =================================== test session starts ====================================
platform darwin -- Python 2.7.10 -- py-1.4.27 -- pytest-2.7.1
rootdir: /Users/dblaikie/testpytest, inifile:
collected 1 items

test_myprogram.py .

 ================================= 1 passed in 0.01 seconds =================================

noticing failures


def doubleit(x):
    var = x * 2
    return x      # oops, returned the original value rather than the doubled value

Having incorporated an error, run py.test again:


$ py.test
 =================================== test session starts ====================================
platform darwin -- Python 2.7.10 -- py-1.4.27 -- pytest-2.7.1
rootdir: /Users/dblaikie/testpytest, inifile:
collected 1 items

test_myprogram.py F

 ========================================= FAILURES =========================================
___________________________________ test_doubleit_value ____________________________________

    def test_doubleit_value():
>       assert myprogram.doubleit(10) == 20
E       assert 10 == 20
E        +  where 10 = (10)
E        +    where  = myprogram.doubleit

test_myprogram.py:7: AssertionError
 ================================= 1 failed in 0.01 seconds =================================



Higher-Order Functions and Decorators

Object References

Everything is an Object; Objects Assigned by Reference


object: a data value of a particular type variable: a name bound to an object


When we create a new object and assign it to a name, we call it a variable. This simply means that the object can now be referred to by that name.

a = 5                 # bind an int object to name a

b = 'hello'           # bind a str object to name b

Here are three classic examples demonstrating that objects are bound by reference and that assignment from one variable to another is simply binding a 2nd name to the same object (i.e. it simply points to the same object -- no copy is made).


Aliasing (not copying) one object to another name:

a = ['a', 'b', 'c']   # bind a list object to a (by reference)

b = a                 # 'alias' the object to b as well -- 2 references

b.append('d')

print(a)               # ['a', 'b', 'c', 'd']

a was not manipulated directly, but it changed. This underscores that a and b are pointing to the same object.


Passing an object by reference to a function:

def modifylist(this_list):   # The object bound to a
    this_list.append('d')    # Modify the object bound to a

a = ['a', 'b', 'c']

modifylist(a)                # Pass object bound to a by reference

print(a)                       # ['a', 'b', 'c', 'd']

The same dynamics at work: a is pointing at the same list object as this_list, so a change through one name is a change to the one object -- and the other name pointing to the same object will see the same changed object.


Alias an object as a container item:

a = ['a', 'b', 'c']          # list bound to a

xx = [1, 2, a]               # 3rd item is reference to list bound to a

xx[2].append('d')            # appending to list object referred to in list item

print(a)                       # ['a', 'b', 'c', 'd']

The same dynamic applied to a container item: the only difference here is that a name and a container item are pointing to the same object.


Function References: Renaming Functions

Functions are variables (objects bound to names) like any other; thus they can be renamed.

def doubleit(val):
    val2 = val * 2
    return val2

print(doubleit)         # <function doubleit at 0x105554d90>

newfunc = doubleit

xx = newfunc(5)        # 10

The output <function doubleit at 0x105554d90> is Python's way of visualizing a function (i.e. showing its printed value). The hex code refers to the function object's location in memory (this can be used for debugging to identify the individual function).


Function References: Functions in Containers

Functions are "first-class" objects, and as such can be stored in containers, or passed to other functions.


Functions can be stored in containers the same way any other object (int, float, list, dict) can be:

def doubleit(val):
    val2 = val * 2
    return val2

def tripleit(val):
    return val * 3

funclist = [ doubleit, tripleit ]

print(funclist[0](10))   # 20
print(funclist[1](10))   # 30

Higher-Order Built-in Functions: map(), grep() and sorted()

These functions allow us to pass a function as an argument to enable its behavior.


We can pass a function name (or a lambda, which is also a reference to a function) to any of these built-in functions. The function controls the built-in function's behavior.


One example is the function passed to sorted():

def by_last(name):
    first, last = name.split()
    return last

names = ['Joe Wilson', 'Zeb Applebee', 'Abe Zimmer']

snames = sorted(names, key=by_last)   # ['Zeb Applebee, 'Joe Wilson', 'Abe Zimmer']

In this example, we are passing the function by_last to sorted(). sorted() calls the function once with each item in the list as argument, and sorts that item by the return value from the function.


In the case of map() (apply a function to each item and return the result) and filter() (apply a function to each item and include it in the returned list if the function returns True), the function is required:

def make_intval(val):
    return int(val)

def over9(val):
    if int(val) > 99:
        return True
    else:
        return False

x = ['1', '100', '11', '10', '110']

# apply make_intval() to each item and sort by the return value
sx = sorted(x, key=make_intval)     # ['1', '10', '11', '100', '110']

# apply make_intval() to each item and store the return value in the resulting list
mx = map(make_intval, x)
print(list(mx))                      # [ 1, 100, 11, 10, 110 ]

# apply over9() to each item and if the return value is True, store in resulting list
fx = filter(over9, x)
print(list(fx))                      # [ '100', '110' ]

Higher-Order Functions: Functions as Arguments and as Return Values

A "higher order" function is one that can be passed to another function, or returned from another function.


The charge() function takes a function as an argument, and uses it to calculate its return value:

def charge(func, val):
    newval = func(val) + val
    return '${}'.format(newval)

def tax_ny(val):
    val2 = val * 0.085
    return val2

def tax_ca(val):
    val2 = val * 0.065
    return val2

nyval = charge(tax_ny, 10)          # pass tax_ny to charge():  $10.85
caval = charge(tax_ca, 10)          # pass tax_ca to charge():  $10.65

Any function that takes a function as an argument or returns one as a return value is called a "higher order" function.


Using a function to build another function

A function can be kind of a "factory" of functions.


This function returns a function as return value, after "seeding" it with a value:

def makemul(mul):
    def times(startval):
        return mul * startval
    return times

doubler = makemul(2)
tripler = makemul(3)

print(doubler(5))      # 10
print(tripler(5))      # 15

In the two examples above, the values 2 and 3 are made part of the returning function -- "seeded" as built-in values .


Decorators

A decorator accepts a function as an argument and returns a replacement function as a return value.


Python decorators return a function to replace the function being decorated -- when the original function is called in the program, Python calls the replacement. A decorator can be added to any function through the use of the @ sign and the decorator name on the line above the function.


Here's a simple example of a function that returns another function (from RealPython blog):

def my_decorator(some_function):

    def wrapper():
        print("Something is happening before some_function() is called.")

        some_function()

        print("Something is happening after some_function() is called.")
    return wrapper


def this_function():
    print("Wheee!")


# now the same name points to a replacement function
this_function = my_decorator(this_function)

# calling the replacement function
this_function()

This is not a decorator yet, but it shows the concept and mechanics


If we wished to use this as a Python decorator, we can simply use the @ decorator notation:

def my_decorator(some_function):

    def wrapper():
        print("Something is happening before some_function() is called.")

        some_function()

        print("Something is happening after some_function() is called.")
    return wrapper

@my_decorator
def this_function():
    print('Wheee!')

this_function()
                       # Something is happening before...
                       # Whee!
                       # Something is happening after...

The benefit here is that, rather than requiring the user to explicitly pass a function to a processing function, we can simply decorate each function to be processed and it will behave as advertised.


*args and **kwargs to capture all arguments

To allow a decorated function to accept arguments, we must accept them and pass them to the decorated function.


Here's a decorator function that adds integer 1 to the return value of a decorated function:

def addone(oldfunc):
    def newfunc(*args, **kwargs):
        return oldfunc(*args, **kwargs) + 1
    return newfunc

@addone
def sumtwo(val1, val2):
    return val1 + val2

y = sumtwo(5, 10)
print(y)                  # 16

Now look closely at def newfunc(*args, **kwargs): *args in a function definition means that all positional arguments passed to it will be collected into a tuple called args. **kwargs in a function definition means that all keyword arguments passed to it will be collected into a dictionary called kwargs. (The * and ** are not part of the variable names; they are notations that allow the arguments to be passed into the tuple and dict.) Then on the next line, look at return oldfunc(*args **kwargs) + 1 *args in a function call means that the tuple args will be passed as positional arguments (i.e., the reverse of what happened above) **kwargs in a function call means that the dict kwargs will be passed as keyword arguments (i.e., the reverse of what happened above) This means that if we wanted to decorate a function that takes different arguments, *args and **kwargs would faithfully pass on those arguments as well.


Here's another example, adapted from the Jeff Knupp blog:

def currency(f):                              # decorator function
    def wrapper(*args, **kwargs):
        return '$' + str(f(*args, **kwargs))
    return wrapper

@currency
def price_with_tax(price, tax_rate_percentage):
    """Return the price with *tax_rate_percentage* applied.
    *tax_rate_percentage* is the tax rate expressed as a float, like
    "7.0" for a 7% tax rate."""

    return price * (1 + (tax_rate_percentage * .01))

print(price_with_tax(50, .10))           # $50.05

In this example, *args and **kwargs represent "unlimited positional arguments" and "unlimited keyword arguments". This is done to allow the flexibility to decorate any function (as it will match any function argument signature).




Raising and Trapping Exceptions

Introduction: Exceptions

An exception is an error condition that can "bubble up" through function calls.


When a program encounters an error it is referred to as an exception. Errors are endemic in any programming language, but we can broadly classify errors into two categories:

  • unanticipated errors (caused by programmer error or oversight)
  • anticipatable errors (caused by incorrect user input, environmental errors such as permissions or missing files, networking or process errors such as database failures, etc.) Trapping exceptions means deciding what to do when an anticipatable error occurs. When we trap an error using a try/except block, we have the opportunity to have our program respond to the error by executing a block of code. In this way, exception handling is another flow control mechanism: if this error occurs, do something about it.


    Objectives for the Unit: Exceptions

  • identify exception types (IndexError, KeyError, IOError, WindowsError, etc.)
  • use try: blocks to identify code where an exception is anticipatable
  • use except: blocks to specify the anticipatable exception and to provide code to be run in the event of an exception
  • trap multiple exception types anticipatable from a try: block, and chain except: blocks to execute different code blocks depending on which exception type was raised.


    Summary: Exceptions signify an error condition

    Exceptions are raised when an error condition is encountered; this condition can be handled.


    In each of these anticipateable errors below, the user can easily enter a value that is invalid and would cause an error if not handled. We are not handling these errors, but we should:


    ValueError: when the wrong value is used with a function or statement

    uin = input('enter an int:  ')  # user enters 'hello'
    
    intval = int(uin)
    
      # Traceback (most recent call last):
      #   File "", line 1, in 
      # ValueError: invalid literal for int() with base 10:  'hello'

    KeyError: when a dictionary key cannot be found. Here we ask the user for a key in the dict, but they could easily enter the wrong key:

    mydict = {'1972': 3.08, '1973': 1.01, '1974': -1.09}
    
    uin = raw_input('please enter a year: ')    # user enters 2116
    
    print 'mktrf for {} is {}'.format(uin, mydict[uin])
    
    
      # Traceback (most recent call last):
      #   File "./test2.py", line 7, in 
      #     print 'mktrf for {} is {}'.format(uin, mydict[uin])
      # KeyError: '2116'

    IndexError: when a list index can't be found. Here we ask the user to enter an argument at the command line, but they could easily skip entering the argument:

    import sys
    
    user_input = sys.argv[1]               # user enters no arg at command line
    
    
    Traceback (most recent call last):
       File "getarg.py", line 3, in 
    IndexError:  1

    OSError: when a file or directory can't be found. Here we ask the user to enter a filename

    import os
    
    user_file = raw_input('please enter a filename:  ')    # user enters a file that doesn't exist
    
    file_size = os.listdir(os.path.getsize(user_file))
    
    
    Traceback (most recent call last):
        File "getsize.py", line 5, in 
    OSError:  No such file or directory:  'mispeld.txt'

    In each of these situations we are working with a an error that occurs not due to an incorrect statement, but because of user input error. We can then say that these errors are anticipatable and thus may be handled by our script.


    Summary statements: try block and except block

    The try: block contains statements from which a potential error condition is anticipated; the except: block identifies the anticipated exception and contains statements to be excecuted if the exception occurs.


    How to avoid an anticipatable exception?

  • wrap the lines where the error is anticipated in a try: block
  • define statements to be executed if the error occurs


    try:
        firstarg = sys.argv[1]
        secondarg = sys.argv[2]
    except IndexError:
        exit('error:  two args required')

    This code anticipates that the user may not pass arguments to the script. If two arguments are not passed, then sys.argv[1] or sys.argv[2] will fail with an IndexError exception.


    Summary technique: trapping multiple exceptions

    Multiple exceptions can be trapped using a tuple of exception types.


    try:
        firstarg = sys.argv[1]
        secondarg = sys.argv[2]
    
        firstint = int(firstarg)
        secondint = int(secondarg)
    
    except (IndexError, ValueError):
        exit('error:  two int args required')

    In this case, whether an IndexError or a ValueError exception is raised, the except: block will be executed.


    Summary technique: chaining except: blocks

    The same try: block can be followed by multiple except: blocks.


    try:
        firstint = int(sys.argv[1])
        secondint = int(sys.argv[2])
    except IndexError:
        exit('error:  two args required')
    except ValueError:
        exit('error:  args must be ints')

    The exception raised will be matched against each type, and the first one found will excecute its block.


    try: blocks include all nested subroutine calls

    A try block around a function call will apply to statements within the block, statements within any function call within the block, and continue on through all successive function calls within. In this example a try catches an error within a twice-nested function call:


    def f():
      print("in f, before 1/0")
      1/0
      print("in f, after 1/0")
    
    def g():
      print("in g, before f()")
      f()
      print("in g, after f()")
    
    def h():
      print("in h, before g()")
      try:
        g()
        print("in h, after g()")
      except ZeroDivisionError:
        print("ZD exception caught")
      print("function h ends")
    
    h()
    print("program continues...")

    The above produces the following output:


    in h, before g()
    in g, before f()
    in f, before 1/0
    ZD exception caught
    function h ends
    program continues...             # note that program continues executing

    If we remove the try/except blocks from the code, Python handles the entire traceback. Note how Python lists out each of the calls that led it to the ZeroDivisionError exception: this can help in debugging:


    in h, before g()
    in g, before f()
    in f, before 1/0
    Traceback (most recent call last):
      File "./test22.py", line 20, in ?
        h()
      File "./test22.py", line 15, in h
        g()
      File "./test22.py", line 10, in g
        f()
      File "./test22.py", line 5, in f
        1/0
    DivisionError: integer division or modulo by zero

    raise

    We can also raise exceptions ourselves, using the raise statement. Here is the form:


    try:
      raise
    except:
      print("exception raised")

    Here is an function that requires two non-empty sequences, and raises an exception if it doesn't find them:


    def crossProduct(seq1, seq2):
      if not seq1 or not seq2:
        raise ValueError("Sequence arguments must be non-empty")
      return [(x1, x2) for x1 in seq1 for x2 in seq2]
    
    crossProduct([1, 2, 3], [ ])

    Note that there is no need for crossProduct to test to see if seq1 and seq2 are iterable - the list comprehension can do that. then the exception will propagate up to the caller of crossProduct, who can handle it if desired.


    raise with arguments

    We can pass arguments to raise (and capture them in except) allowing for more detailed recovery and/or error messages:


    import sys
    
    def get_arg():
      return int(sys.argv[1])
    
    def main():
        try:
            index = get_arg()
        except (IndexError, ValueError):
            raise ValueError('please enter an int argument')
    
    if __name__ == '__main__':
        try:
            main()
        except Exception as e:
            msg = e.args[0]
            exit('error:  {}'.format(msg))
        exit(0)

    Here in main() we trapped two possible exceptions that might have been rasied in get_arg(), then raised an ValueError with our own (descriptive) message so it could be trapped and handled in the main body, exiting with the message (of course we could call usage() here as well).


    raise by itself

    raise by itself re-raises a currently occuring exception. raise can be called only while an exception handling is in progress, i.e. from within an except block or from within a function called from an except block. It is used to re-raise the current exception when the handler decides it would prefer not to handle the exception.


    In the below example, note the use of raise near the bottom.

    """ get_done.py -- demo program """
    
    import sys
    
    DEBUG = False                  # constant can be set manually as desired
    
    def do_something():
        try:
            arg = sys.argv[1]      # might raise an IndexError
            int_arg = int(arg)     # might raise a ValueError
        except (IndexError, ValueError):
            raise ValueError('please enter a positive integer')
        if not int_arg > 0:
            raise ValueError('please enter a positive integer')
        return int_arg
    
    def main():
        val = do_something()
        print(('value is {}'.format(val)))
    
    if __name__ == '__main__':
        try:
            main()
        except Exception as obj:       # traps most exceptions
            if DEBUG:                  # if DEBUG == True,
                raise                  # re-raise same exception
            else:
                sys.exit(obj.args[0])  # exit with error msg from exception
    
        exit(0)                        # exit without error

    First, except Exception traps almost any exception raised within its try: block (this is because Exception is a parent class of most other exceptions). Remember that since exceptions bubble up from function calls, this means that the try: block wrapping the call to main() will trap any exception that occurred in the functions called by the script. The obj used in the except Exception as obj statement is the exception object passed by a trapped exception. We can use obj.args[0] to read any arguments passed by the exception. In our case, we have possibly raised two exceptions in do_something() (and Python may have raised an unexpected exception as well). Since our raise statement includes an error message, we can read this message with args[0]. The idea here is that the program can be run in one of two modes: with DEBUG=False, the program calls exit() with the error message passed by the raise. With DEBUG=True, the program simply re-raises the exception -- this provides us with the stack trace that can help us diagnose the problem. In many programs (particularly, web server-side programs), we don't want the user to see an exception stack trace -- this exposes lines of code from our program which might be sensitive. The method used here allows us to control how the program handles the error, and thus what the user sees. It should be understood that this is not the only way to use exceptions for flow control, but can be useful in some situations. It is intended more to indicate what is possible using the exception handling mechanism.




    Testing

    Writing tests

    Use assert to test values returned from a function against expected values


    assert raises an AssertionError exception if the test returns False

    assert 5 == 5        # no output
    
    assert 5 == 10       # AssertionError raised

    We can incorporate this facility in a simple testing program: program to be tested: "myprogram.py"


    import sys
    
    def doubleit(x):
        var = x * 2
        return var
    
    if __name__ == '__main__':
        input_val = sys.argv[1]
        doubled_val = doubleit(input_val)
    
        print("the value of {0} is {1}".format(input_val, doubled_val))

    testing program: "test_myprogram.py"


    import myprogram
    
    def test_doubleit_value():
        assert myprogram.doubleit(10) == 20

    If doubleit() didn't correctly return 20 with an argument of 10, the assert would raise an AssertionError. So even with this basic approach (without a testing module like pyttest or unittest), we can do testing with assert.


    Writing tests with pytest

    All programs named test_something.py that have functions named test_something() will be noticed by pytest and run automatically when we run the pytest script py.test.


    instructions for writing and running tests using pytest 1. Make sure your program, myprogram.py and your testing program test_myprogram.py are in the same directory. 2. Open up a command prompt. In Mac/Unix, open the Terminal program (you can also use the Terminal window in PyCharm). In Windows, you must launch the Anaconda Command Prompt, which should be accessible by searching for cmd on Windows 10 -- let me know if you have trouble finding the Anaconda prompt. 3. Use cd to change the present working directory for your Command Prompt or Terminal window session to the same directory (let me know if you have any trouble with this step.) 4. Execute the command py.test at the command line (keep in mind this is not the Python prompt, but your Mac/Unix Terminal program or your Anaconda cmd/Command Prompt window). py.test is a special command that should work from your command line if you have Anaconda installed. (py.test is not a separate file.) 5. If your program and functions are named as directed, the testing program will import your script and test each function to see that it is providing correct output. If there are test failures, look closely at the failure output -- look for the assert test showing what values were involved in the failure. You can also look at the testing program to see what it is requiring (look for the assert statements). 6. If you see collected 0 items it means there was no test_[something].py file (where [something] is a name of your choice) or there was no test_[something]() function inside the test program. These names are required by py.test. 7. If your run of py.test hangs (i.e., prints something out but then just waits), or if you see a lot of colorful error output saying not found here and there, it may be for the above reason.


    running py.test from the command line or Anaconda command prompt:

    $ py.test
     =================================== test session starts ====================================
    platform darwin -- Python 2.7.10 -- py-1.4.27 -- pytest-2.7.1
    rootdir: /Users/dblaikie/testpytest, inifile:
    collected 1 items
    
    test_myprogram.py .
    
     ================================= 1 passed in 0.01 seconds =================================

    noticing failures


    def doubleit(x):
        var = x * 2
        return x      # oops, returned the original value rather than the doubled value

    Having incorporated an error, run py.test again:


    $ py.test
     =================================== test session starts ====================================
    platform darwin -- Python 2.7.10 -- py-1.4.27 -- pytest-2.7.1
    rootdir: /Users/dblaikie/testpytest, inifile:
    collected 1 items
    
    test_myprogram.py F
    
     ========================================= FAILURES =========================================
    ___________________________________ test_doubleit_value ____________________________________
    
        def test_doubleit_value():
    >       assert myprogram.doubleit(10) == 20
    E       assert 10 == 20
    E        +  where 10 = (10)
    E        +    where  = myprogram.doubleit
    
    test_myprogram.py:7: AssertionError
     ================================= 1 failed in 0.01 seconds =================================

    Testing the expected raising of an exception

    Many of our tests will deliberately pass bad input and test to see that an appropriate exception is raised.


    import sys
    
    def doubleit(x):
        if not isinstance(x, (int, float)):           # make sure the arg is the right type
            raise TypeError('must be int or float')   # if not, raise a TypeError
        var = x * 2
        return var
    
    if __name__ == '__main__':
        input_val = sys.argv[1]
        doubled_val = doubleit(input_val)
    
        print("the value of {0} is {1}".format(input_val, doubled_val))

    Note that without type testing, the function could work, but incorrectly (for example if a string or list were passed instead of an integer). To verify that this error condition is correctly raised, we can use with pytest.raises(TypeError).


    import myprogram
    import pytest
    
    def test_doubleit_value():
        assert myprogram.doubleit(10) == 20
    
    def test_doubleit_type():
        with pytest.raises(TypeError):
            myprogram.doubleit('hello')

    with is the same context manager we have used with open(): it can also be used to detect when an exception occured inside the with block.


    Grouping tests into a class

    We can organize related tests into a class, which can also include setup and teardown routines that are run automatically (discussed next).


    """ test_myprogram.py -- test functions in a testing class """
    
    import myprogram
    import pytest
    
    class TestDoubleit(object):
    
        def test_doubleit_value(self):
            assert myprogram.doubleit(10) == 20
    
        def test_doubleit_type(self):
            with pytest.raises(TypeError):
                myprogram.doubleit('hello')

    So now the same rule applies for how py.test looks for tests -- if the class begins with the word Test, pytest will treat it as a testing class.


    Mock data: setup and teardown

    Tests should not be run on "live" data; instead, it should be simulated, or "mocked" to provide the data the test needs.


    """ myprogram.py -- makework functions for the purposes of demonstrating testing """
    
    import sys
    
    def doubleit(x):
        """ double a number argument, return doubled value """
        if not isinstance(x, (int, float)):
            raise TypeError('arg to doublit() must be int or float')
        var = x * 2
        return var
    
    def doublelines(filename):
        """open a file of numbers, double each line, write each line to a new file"""
        with open(filename) as fh:
            newlist = []
            for line in fh:                   # file is assumed to have one number on each line
                floatval = float(line)
                doubleval = doubleit(floatval)
                newlist.append(str(doubleval))
        with open(filename, 'w') as fh:
            fh.write('\n'.join(newlist))
    
    if __name__ == '__main__':
        input_val = sys.argv[1]
        doubled_val = doubleit(input_val)
    
        print("the value of {0} is {1}".format(input_val, doubled_val))

    For this demo I've invented a rather arbitrary example to combine an external file with the doubleit() routine: doublelines() opens and reads a file, and for each line in the file, doubles the value, writing each value as a separate line to a new file (supplied to doublelines()).


    """ test_myprogram.py -- test the doubleit.py script """
    
    import myprogram
    import os
    import pytest
    import shutil
    
    class TestDoubleit(object):
    
        numbers_file_template = 'testnums_template.txt'  # template for test file (stays the same)
        numbers_file_testor = 'testnums.txt'             # filename used for testing
                                                         # (changed during testing)
    
        def setup_class(self):
            shutil.copy(TestDoubleit.numbers_file_template, TestDoubleit.numbers_file_testor)
    
        def teardown_class(self):
            os.remove(TestDoubleit.numbers_file_testor)
    
        def test_doublelines(self):
            myprogram.doublelines(TestDoubleit.numbers_file_testor)
            old_vals = [ float(line) for line in open(TestDoubleit.numbers_file_template) ]
            new_vals = [ float(line) for line in open(TestDoubleit.numbers_file_testor) ]
            for old_val, new_val in zip(old_vals, new_vals):
                assert float(new_val) == float(old_val) * 2
    
        def test_doubleit_value(self):
            assert myprogram.doubleit(10) == 20
    
        def test_doubleit_type(self):
            with pytest.raises(TypeError):
                myprogram.doubleit('hello')

    setup_class and teardown_class run automatically. As you can see, they prepare a dummy file and when the testing is over, delete it. In between, tests are run in order based on the function names.




    Packages

    Introduction: Packages

    A package is a directory of files that work together as a Python application or library module.


    Many applications or library modules consist of more than one file. A script may require configuration files or data files; some applications combine several .py files that work together. In addition, programs need unit tests to ensure reliability. A package groups all of these files (scripts, supporting files and tests) together as one entity. In this unit we'll discover Python's structure and procedures for creating packages. Some of the steps here were taken from this very good tutorial on packages: https://python-packaging.readthedocs.io/en/latest/minimal.html


    Package Structure and the __init__.py File

    The base of a package is a directory with an __init__.py file.


    Folder structure for package davehello:

    davehello/                      # base package folder - name is discretionary
        davehello/                  # module folder -  usually same name
            __init__.py             # initial script -- this is run first
        setup.py                    # setup file -- discussed below

    The initial code for our program: __init__.py

    def greet():
        return 'hello, world!'

    The names of the folders are up to you. The "outer" davehello/ is the name of the base package folder. The "inner" davehello/ is the name of your module. These can be the same or different. setup.py is discussed next.


    setup.py: the installation script

    This file describes the script and its authorship.


    Inside setup.py put the following code, but replace the name, author, author_email and packages (this list should reflect the name in name):

    from setuptools import setup
    
    setup( name='davehello',
           version='0.1',
           description='This module greets the user.  ',
           url='',                # usually a github URL
           author='David Blaikie',
           author_email='david@davidbpython.com',
           license='MIT',
           packages=['davehello'],
           install_requires=[ ],
           zip_safe=False )

    setuptools is a Python module for preparing modules. The setup() function establishes meta information for the package.


    url can be left blank for now. Later on we will commit this package to github and put the github URL here. packages should be a list of packages that are part this package (as there can be sub-packages within a package); however we will just work with the one package.


    Again, folder structure for package davehello with two files:

    davehello/                      # base package folder - name is discretionary
        davehello/                  # module folder -  usually same name
            __init__.py             # initial script -- this is run first
        setup.py                    # setup file -- discussed below

    Please doublecheck your folder structure and the placement of files -- this is vital to being able to run the files.


    Installing Locally

    pip install can install your module into your own local Python module directories.


    First, make sure you're in the same directory as setup.py. Then from the Unix/Mac Terminal, or Windows Command Prompt :

    $ pip install .                            # $ means Unix/Mac prompt
    Processing /Users/david/Desktop/davehello
    Installing collected packages: davehello
      Running setup.py install for davehello ... done
    Successfully installed davehello-0.1

    The install module copies your package files to a Python install directory that is part of your Python installation's sys.path. Remember that the sys.path holds a list of directories that will be searched when you import a module. If you get an error when you try to install, double check your folder structure and placement of files, and make sure you're in the same directory as setup.py.


    If successful, you should now be able to open a new Terminal or Command Prompt window (on Windows use the Anaconda Prompt), cd into your home directory, launch a Python interactive session, and import your module:

    $ cd /Users/david                # moving to my home directory, to make sure
                                     # we're running the installed version
    
    $ python
    Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import davehello
    >>> davehello.greet()
    'hello, world!'
    >>>

    If you get a ModuleNotFound error when you try to import: 1. It is possible that the files are not in their proper places, for example if __init__.py is in the same directory as setup.py. 2. It is possible that the pip you used installed the module into a different distribution than the one you are running.


    $ pip --version
    pip 9.0.1 from /Users/david/anaconda/lib/python3.6/site-packages (python 3.6)
    
    Davids-MacBook-Pro-2:~ david$ python -V
    Python 3.6.1 :: Anaconda custom (64-bit)
    Davids-MacBook-Pro-2:~ david$

    Note that my pip --version path indicates that it's running under Anaconda, and my python -V also indicates Anaconda.


    Package Development Directory vs. Installation Directory

    Development Directory is where you created the files; Installation Directory is where Python copied them upon install.


    Keep in mind that when you import a module, the current directory will be searched before any directories on the sys.path. So if your command line / Command Prompt / Terminal session is currently in the same directory as setup.py (as we had been before we did a cd to my home directory), you'll be reading from your local package, not the installed one. So you won't be testing the installation until you move away from the package directory.


    To see which folder the module was installed into, make sure you're not in the package directory; then read the module's __file__ attribute:

    $ cd /Users/david                # moving to my home directory, to make sure
                                     # we're running the installed version
    
    $ python
    Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import davehello
    >>> davehello.__file__
    
    '/Users/david/anaconda/lib/python3.6/site-packages/davehello/__init__.py'

    Note that this is not one of my directories or one that I write to; it is a common install directory for Anaconda Python.


    Contrast this with the result if you're importing the module from the package directory (the same directory as setup.py):

    $ cd /Users/david/Desktop/davehello/          # my package location
    
    $ python
    
    >>> import davehello
    >>> davehello.__file__
    
    '/Users/david/Desktop/davehello/davehello/__init__.py'

    Note that this is one of my local directories.


    Making Changes and Reinstalling

    Changes to the package will not be reflected in the installed module unless we reinstall.


    If you make a change to the package source files, it won't be reflected in your Python installation until you reinstall with pip install. (The exception to this is if you happen to be importing the module from within the package itself -- then the import will read from the local files.)


    To reinstall a previously installed module, we must include the --upgrade flag:

    $ pip install . --upgrade
    Processing /Users/david/Desktop/davehello
    Installing collected packages: davehello
      Found existing installation: davehello 0.1
        Uninstalling davehello-0.1:
          Successfully uninstalled davehello-0.1
      Running setup.py install for davehello ... done
    Successfully installed davehello-0.1

    Adding files

    __init__.py is the "gateway" file; the bulk of code may be in other .py files in the package.


    Many packages are made of up several .py files that work together. They may be files that are only called internally, or they may be intended to be called by the user. Your entire module could be contained within __init__.py, but I believe this is customarily used only as the gateway, with the bulk of module code in another .py file. In this step we'll move our function to another file.


    hello.py (new file -- this can be any name)

    def greet():
        return 'hello, new file!'

    __init__.py

    from .hello import greet       # .hello refers to hello.py in the base package directory

    New folder structure for package davehello:

    davehello/                      # base folder - name is discretionary
        davehello/                  # package folder -  usually same name
            __init__.py             # initial script -- this is run first
            hello.py                # new file
        setup.py                    # setup file -- discussed below

    Don't forget to reinstall the module once you've finalized changes. However you can run the package locally (i.e., from the same directory as setup.py) without reinstalling. When the package is imported, Python reads and executes the ___init___.py program. This file is now importing greet from hello.py into the module's namespace, making it available to the user under the package name davehello.


    The user can also reach the variable in hello.py directly, by using attribute syntax to reach the module -- so both of these calls to greet() should work:

    >>> import davehello as dh
    >>> dh.greet()                    # 'hello, new file!'
                                      # (because __init__.py imported greet from hello.py)
    
    >>> dh.hello.greet()              # 'hello, new file!'
                                      # (calling it directly in hello.py)
    
    >>> from davehello import hello
    >>> hello.greet()                 # 'hello, new file!'
    

    The mechanics of files and variables in packages

    Packages provide for accessing variables within multiple files.



    Specifying Dependencies

    Dependencies are other modules that your module may need to import.


    If your module imports a non-standard module like splain, it is known as a dependency. Dependencies must be mentioned in the setup() spec. The installer will make sure any dependent modules are installed so your module works correctly.


    setup(name='davehello',
          version='0.1',
          description='This module greets the user.  ',
          url='',                # usually a github URL
          author='David Blaikie',
          author_email='david@davidbpython.com',
          license='MIT',
          packages=['davehello'],
          install_requires=[ 'splain' ],
          zip_safe=False)

    This would not be necessary if the user already had splain installed. However, if they didn't, we would want the install of our module to result in the automatic installation of the splain module. (Please note that splain.py has not yet been uploaded to PyPI, so the above dependency will not work.)


    Adding Tests

    Tests belong in the package; thus anyone who downloads the source can run the tests.


    In a package, tests should be added to a tests/ directory in the package root (i.e., in the same directory as setup.py).


    We will use pytest for our testing -- the following configuration values need to be added to setup() in the setup.py file:

    test_suite='pytest'
    setup_requires=['pytest-runner']
    tests_require=['pytest']

    Here's our updated setup.py:

    from setuptools import setup
    
    setup(name='davehello',
          version='0.1',
          description='This module greets the user.  ',
          url='',                # usually a github URL
          author='David Blaikie',
          author_email='david@davidbpython.com',
          license='MIT',
          packages=['davehello'],
          install_requires=[ ],
          test_suite='pytest',
          setup_requires=['pytest-runner'],
          tests_require=['pytest'],
          zip_safe=False)

    As is true for most testing suites, pytest requires that our test filenames should begin with test_, and test function names begin with test_.


    Here is our test program test_hello.py, with test_greet(), which tests the greet() function.

    import pytest
    import davehello as dh
    
    def test_greet():
        assert dh.greet() == 'hello, world!'

    Here's a new folder structure for package davehello:

    davehello/                      # base folder - name is discretionary
        davehello/                  # package folder -  usually same name
            __init__.py             # initial script -- this is run first
            hello.py                # new file
        tests/
            test_hello.py
        setup.py                    # setup file -- discussed below

    Now when we'd like to run the package's tests, we run the following at the command line:

    $ python setup.py pytest
    running pytest
    running egg_info
    writing davehello.egg-info/PKG-INFO
    writing dependency_links to davehello.egg-info/dependency_links.txt
    writing top-level names to davehello.egg-info/top_level.txt
    reading manifest file 'davehello.egg-info/SOURCES.txt'
    writing manifest file 'davehello.egg-info/SOURCES.txt'
    running build_ext
    ------------------------------- test session starts -------------------------------
    
    platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
    rootdir: /Users/david/Desktop/davehello, inifile:
    collected 1 items
    
    tests/test_hello.py F
    ----------------------------------- FAILURES -----------------------------------
    
        def test_greet():
    >       assert dh.greet() == 'hello, world!'
    E       AssertionError: assert 'hello, new file!' == 'hello, world!'
    E         - hello, new file!
    E         + hello, world!
    
    tests/test_hello.py:7: AssertionError
    --------------------------- 1 failed in 0.03 seconds ---------------------------

    oops, our test failed. We're not supplying the right value to assert -- the function returns hello, new file! and our test is looking for hello, world!. We go into test_hello.py and modify the assert statement; or alternatively, we could change the output of the function.


    After change has been made to test_hello.py to reflect the expected output:

    $ python setup.py pytest
    running pytest
    running egg_info
    writing dblaikie_hello.egg-info/PKG-INFO
    writing dependency_links to dblaikie_hello.egg-info/dependency_links.txt
    writing top-level names to dblaikie_hello.egg-info/top_level.txt
    reading manifest file 'dblaikie_hello.egg-info/SOURCES.txt'
    writing manifest file 'dblaikie_hello.egg-info/SOURCES.txt'
    running build_ext
    ------------------------------- test session starts -------------------------------
    platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
    rootdir: /Users/david/Desktop/davehello, inifile:
    collected 1 items
    
    davehello/tests/test_hello.py .
    
    ----------------------------- 1 passed in 0.01 seconds ----------------------------

    The output first shows us what setup.py is doing in the background, then shows collected 1 items to indicate that it's ready to run tests. The final statement indicates how many tests passed (or failed).


    With these basic steps you can create a package, install it in your Python distribution, and prepare it for distribution to the world. May all beings be happy.




    Distributing Modules and the PyPI Package Repo

    PyPI: The Python Package Index

    All publicly available modules can be found here.


    https://testpypi.python.org/pypi


    Registering your Application on PyPI

    setuptools can do this for us automatically.


    Davids-MacBook-Pro-2:dblaikie_hello david$ python setup.py register
    running register
    running egg_info
    writing dblaikie_hello.egg-info/PKG-INFO
    writing dependency_links to dblaikie_hello.egg-info/dependency_links.txt
    writing top-level names to dblaikie_hello.egg-info/top_level.txt
    reading manifest file 'dblaikie_hello.egg-info/SOURCES.txt'
    writing manifest file 'dblaikie_hello.egg-info/SOURCES.txt'
    running check
    We need to know who you are, so please choose either:
     1. use your existing login,
     2. register as a new user,
     3. have the server generate a new password for you (and email it to you), or
     4. quit
    Your selection [default 1]:

    twine for uploading to PyPI

    $ pip install twine



    Classes

    Introduction: Classes

    Classes allow us to create a custom type of object -- that is, an object with its own behaviors and its own ways of storing data. Consider that each of the objects we've worked with previously has its own behavior, and stores data in its own way: dicts store pairs, sets store unique values, lists store sequential values, etc. An object's behaviors can be seen in its methods, as well as how it responds to operations like subscript, operators, etc. An object's data is simply the data contained in the object or that the object represents: a string's characters, a list's object sequence, etc.


    Objectives for this Unit: Classes

  • Understand what classes, objects and attributes are and why they are useful
  • Create our own classes -- our own object types
  • Set attributes in objects and read attributes from objects
  • Define methods in classes that can be used by objects
  • Define object initializers with __init__()
  • Use getter and setter methods to enforce encapsulation
  • Understand class inheritance
  • Understand polymorphism


    Class Example: the date and timedelta object types

    First let's look at object types that demonstrate the convenience and range of behaviors of objects.


    A date object can be set to any date and knows how to calculate dates into the future or past. To change the date, we use a timedelta object, which can be set to an "interval" of days to be added to or subtracted from a date object.


    from datetime import date, timedelta
    
    dt = date(1926, 12, 30)         # create a new date object set to 12/30/1926
    td = timedelta(days=3)          # create a new timedelta object:  3 day interval
    
    dt = dt + timedelta(days=3)     # add the interval to the date object:  produces a new date object
    
    print(dt)                        # '1927-01-02' (3 days after the original date)
    
    
    dt2 = date.today()              # as of this writing:  set to 2016-08-01
    dt2 = dt2 + timedelta(days=1)   # add 1 day to today's date
    
    print(dt2)                       # '2016-08-02'
    
    print(type(dt))                  # <type 'datetime.datetime'>
    print(type(td))                  # <type 'datetime.timedelta'>

    Class Example: the proposed server object type

    Now let's imagine a useful object -- this proposed class will allow you to interact with a server programmatically. Each server object represents a server that you can ping, restart, copy files to and from, etc.


    import time
    from sysadmin import Server
    
    
    s1 = Server('blaikieserv')
    
    if s1.ping():
        print('{} is alive '.format(s1.hostname))
    
    s1.restart()                       # restarts the server
    
    s1.copyfile_up('myfile.txt')       # copies a file to the server
    s1.copyfile_down('yourfile.txt')   # copies a file from the server
    
    print(s1.uptime())                  # blaikieserv has been alive for 2 seconds

    A class block defines an object "factory" which produces objects (instances) of the class.

    Method calls on the object refer to functions defined in the class.


    class Greeting(object):
        """ greets the user """
    
        def greet(self):
            print('hello, user!')
    
    
    c = Greeting()
    
    c.greet()                    # hello, user!
    
    print(type(c))                # <class '__main__.Greeting'>

    Each class object or instance is of a type named after the class. In this way, class and type are almost synonymous.


    Each object holds an attribute dictionary

    Data is stored in each object through its attributes, which can be written and read just like dictionary keys and values.


    class Something(object):
        """ just makes 'Something' objects """
    
    obj1 = Something()
    obj2 = Something()
    
    obj1.var = 5             # set attribute 'var' to int 5
    obj1.var2 = 'hello'      # set attribute 'var2' to str 'hello'
    
    obj2.var = 1000          # set attribute 'var' to int 1000
    obj2.var2 = [1, 2, 3, 4] # set attribute 'var2' to list [1, 2, 3, 4]
    
    
    print(obj1.var)           # 5
    print(obj1.var2)          # hello
    
    print(obj2.var)           # 1000
    print(obj2.var2)          # [1, 2, 3, 4]
    
    obj2.var2.append(5)      # appending to the list stored to attribute var2
    
    print(obj2.var2)          # [1, 2, 3, 4, 5]

    In fact the attribute dictionary is a real dict, stored within a "magic" attribute of the object:

    print(obj1.__dict__)      # {'var': 5, 'var2': 'hello'}
    
    print(obj2.__dict__)      # {'var': 1000, 'var2': [1, 2, 3, 4, 5]}

    The class also holds an attribute dictionary

    Data can also be stored in a class through class attributes or through variables defined in the class.


    class MyClass():
        """ The MyClass class holds some data """
    
        var = 10              # set a variable in the class (a class variable)
    
    
    MyClass.var2 = 'hello'    # set an attribute directly in the class object
    
    print(MyClass.var)         # 10      (attribute was set as variable in class block)
    print(MyClass.var2)        # 'hello' (attribute was set as attribute in class object)
    
    print(MyClass.__dict__)    # {'var': 10,
                              #  '__module__': '__main__',
                              #  '__doc__': ' The MyClass class holds some data ',
                              #  'var2': 'hello'}

    The additional __module__ and __doc__ attributes are automatically added -- __module__ indicates the active module (here, that the class is defined in the script being run); __doc__ is a special string reserved for documentation on the class).


    object.attribute lookup tries to read from object, then from class

    If an attribute can't be found in an object, it is searched for in the class.


    class MyClass(object):
      classval = 10         # class attribute
    
    a = MyClass()
    b = MyClass()
    
    b.classval = 99         # instance attribute of same name
    
    print(a.classval)        # 10 - still class attribute
    print(b.classval)        # 99 - instance attribute
    
    del b.classval          # delete instance attribute
    
    print(b.classval)        # 10 -- now back to class attribute

    Method calls pass the object as first (implicit) argument, called self

    Object methods or instance methods allow us to work with the object's data.


    class Do(object):
        def printme(self):
            print(self)      # <__main__.Do object at 0x1006de910>
    
    x = Do()
    
    print(x)                 # <__main__.Do object at 0x1006de910>
    x.printme()

    Note that x and self have the same hex code. This indicates that they are the very same object.


    Instance methods / object methods and object attributes: changing object "state"

    Since instance methods pass the object, and we can store values in object attributes, we can combine these to have a method modify an object's values.


    class Sum(object):
        def add(self, val):
            if not hasattr(self, 'x'):
                self.x = 0
            self.x = self.x + val
    
    myobj = Sum()
    myobj.add(5)
    myobj.add(10)
    
    print(myobj.x)      # 15

    Objects are often modified using getter and setter methods

    These methods are used to read and write object attributes in a controlled way.


    class Counter(object):
        def setval(self, val):     # arguments are:  the instance, and the value to be set
            if not isinstance(val, int):
                raise TypeError('arg must be a string')
    
            self.value = val        # set the value in the instance's attribute
    
        def getval(self):          # only one argument:  the instance
            return self.value       # return the instance attribute value
    
        def increment(self):
            self.value = self.value + 1
    
    a = Counter()
    b = Counter()
    
    a.setval(10)       # although we pass one argument, the implied first argument is a itself
    
    a.increment()
    a.increment()
    
    print(a.getval())   # 12
    
    
    b.setval('hello')  # TypeError

    __init__() is automagically called when a new instance is created

    The initializer of an object allows us to set the initial attribute values of the object.


    class MyCounter(object):
      def __init__(self, initval):   # self is implied 1st argument (the instance)
        try:
          initval = int(initval)     # test initval to be an int,
        except ValueError:           # set to 0 if incorrect
          initval = 0
        self.value = initval         # initval was passed to the constructor
    
      def increment_val(self):
        self.value = self.value + 1
    
      def get_val(self):
        return self.value
    
    a = MyCounter(0)
    b = MyCounter(100)
    
    a.increment_val()
    a.increment_val()
    a.increment_val()
    
    b.increment_val()
    b.increment_val()
    
    print(a.get_val())    # 3
    print(b.get_val())    # 102

    Classes can be organized into an an inheritance tree

    When a class inherits from another class, attribute lookups can pass to the parent class when accessed from the child.


    class Animal(object):
      def __init__(self, name):
        self.name = name
      def eat(self, food):
        print('{} eats {}'.format(self.name, food))
    
    class Dog(Animal):
      def fetch(self, thing):
        print('{} goes after the {}!'.format(self.name, thing))
    
    class Cat(Animal):
      def swatstring(self):
        print('{} shreds the string!'.format(self.name))
      def eat(self, food):
        if food in ['cat food', 'fish', 'chicken']:
          print('{} eats the {}'.format(self.name, food))
        else:
          print('{}:  snif - snif - snif - nah...'.format(self.name))
    
    d = Dog('Rover')
    c = Cat('Atilla')
    
    d.eat('wood')                 # Rover eats wood.
    c.eat('dog food')             # Atilla:  snif - snif - snif - nah...

    Conceptually similar methods can be unified through polymorphism

    Same-named methods in two different classes can share a conceptual similarity.


    class Animal(object):
      def __init__(self, name):
        self.name = name
      def eat(self, food):
        print('{} eats {}'.format(self.name, food))
    
    class Dog(Animal):
      def fetch(self, thing):
        print('{} goes after the {}!'.format(self.name, thing))
      def speak(self):
        print('{}:  Bark!  Bark!'.format(self.name))
    
    class Cat(Animal):
      def swatstring(self):
        print('{} shreds the string!'.format(self.name))
      def eat(self, food):
        if food in ['cat food', 'fish', 'chicken']:
          print('{} eats the {}'.format(self.name, food))
        else:
          print('{}:  snif - snif - snif - nah...'.format(self.name))
      def speak(self):
        print('{}:  Meow!'.format(self.name))
    
    for a in (Dog('Rover'), Dog('Fido'), Cat('Fluffy'), Cat('Precious'), Dog('Rex'), Cat('Kittypie')):
      a.speak()
    
                       # Rover:  Bark!  Bark!
                       # Fido:  Bark!  Bark!
                       # Fluffy:  Meow!
                       # Precious:  Meow!
                       # Rex:  Bark!  Bark!
                       # Kittypie:  Meow!

    Static Methods and Class Methods

    A class method can be called through the instance or the class, and passes the class as the first argument. We use these methods to do class-wide work, such as counting instances or maintaining a table of variables available to all instances. A static method can be called through the instance or the class, but knows nothing about either. In this way it is like a regular function -- it takes no implicit argument. We can think of these as 'helper' functions that just do some utility work and don't need to involve either class or instance.


    class MyClass(object):
    
      def myfunc(self):
        print("myfunc:  arg is {}".format(self))
    
      @classmethod
      def myclassfunc(klass):      # we spell it differently because 'class' will confuse the interpreter
        print("myclassfunc:  arg is {}".format(klass))
    
      @staticmethod
      def mystaticfunc():
        print("mystaticfunc: (no arg)")
    
    a = MyClass()
    
    a.myfunc()             # myfunc:  arg is <__main__.MyClass instance at 0x6c210>
    
    MyClass.myclassfunc()  # myclassfunc:  arg is __main__.MyClass
    a.myclassfunc()        # [ same ]
    
    a.mystaticfunc()       # mystaticfunc: (no arg)

    Here is an example from Learning Python, which counts instances that are constructed:


    class Spam:
    
      numInstances = 0
    
      def __init__(self):
        Spam.numInstances += 1
    
      @staticmethod
      def printNumInstances():
        print "instances created:  ", Spam.numInstances
    
    s1 = Spam()
    s2 = Spam()
    s3 = Spam()
    
    Spam.printNumInstances()        # instances created:  3
    s3.printNumInstances()          # instances created:  3



    urllib

    Python as a web client: the urllib module

    A Python program can take the place of a browser, requesting and downloading CSV, HTML pages and other files. Your Python program can work like a web spider (for example visiting every page on a website looking for particular data or compiling data from the site), can visit a page repeatedly to see if it has changed, can visit a page once a day to compile information for that day, etc.


    urllib is a full-featured module for making web requests. Although the requests module is strongly favored by some for its simplicity, it has not yet been added to the Python builtin distribution.


    The urlopen method takes a url and returns a file-like object that can be read() as a file:

    import urllib.request
    my_url = 'http://www.google.com'
    readobj = urllib.request.urlopen(my_url)  # return a 'file-like' object
    text = readobj.read()                     # read into a 'byte string'
    # text = text.decode('utf-8')             # optional, sometimes required:
                                              # decode as a 'str' (see below)
    readobj.close()

    Alternatively, you can call readlines() on the object (keep in mind that many objects that can deliver file-like string output can be read with this same-named method):

    for line in readobj.readlines():
      print(line)
    readobj.close()

    POTENTIAL ERRORS AND REMEDIES WITH urllib


    TypeError mentioning 'bytes' -- sample exception messages:

    TypeError: can't use a string pattern on a bytes-like object
    TypeError: must be str, not bytes
    TypeError: can't concat bytes to str

    These errors indicate that you tried to use a byte string where a str is appropriate.


    The urlopen() response usually comes to us as a special object called a byte string. In order to work with the response as a string, we can use the decode() method to convert it into a string with an encoding.

    text = text.decode('utf-8')

    'utf-8' is the most common encoding, although others ('ascii', 'latin-1', 'ISO-8859-1', 'utf-16', 'utf-32', and more) may be required.


    I have found that we do not always need to convert (depending on what you will be doing with the returned string) which is why I commented out the line in the first example. SSL Certificate Error Many websites enable SSL security and require a web request to accept and validate an SSL certificate (certifying the identity of the server). urllib by default requires SSL certificate security, but it can be bypassed (keep in mind that this may be a security risk).


    import ssl
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    my_url = 'http://www.nytimes.com'
    readobj = urllib.request.urlopen(my_url, context=ctx)

    Encoding Parameters: urllib.requests.urlencode()

    When including parameters in our requests, we must encode them into our request URL. The urlencode() method does this nicely:


    import urllib.request, urllib.parse
    
    params = urllib.parse.urlencode({'choice1': 'spam and eggs', 'choice2': 'spam, spam, bacon and spam'})
    print("encoded query string: ", params)
    f = urllib.request.urlopen("http://www.google.com?{}".format(params))
    print(f.read())

    this prints:

    encoded query string: choice1=spam+and+eggs&choice2=spam%2C+spam%2C+bacon+and+spam
    
    choice1:  spam and eggs<BR>
    choice2:  spam, spam, bacon and spam<BR>



    HTML and Bootstrap css

    HTML

    HTML pages are simply web pages that are displayed by a browser.


    An extremely basic HTML page

    <HTML>
      <HEAD>
        <TITLE>The Page of Greeting</TITLE>
      </HEAD>
      <BODY>
      Hello!
      </BODY>
    </HTML>

    "Static" HTML pages are plaintext files that are saved to a computer's disk. When requested through a web server, the server program simply finds the file and returns it to the browser. When requested through your local computer's filesystem, the browser simply reads the file off of the disk in the same way any file reader might. Once you've saved your HTML to disk you can easily display it in a browser using File > Open and browse to the HTML page.


    saved in a file called hello.txt:

    Hello, plaintext!

    You can view any HTML page saved to your local filesystem by using your browser's File > Open menu option.


    Basic HTML Tags

    Here are the most common HTML tags


    <HTML>, <HEAD>, <BODY>overall structure
    <DIV>, <SPAN>layout section tags
    <LINK>, <META>meta / summary information about page
    <H1>, <H2>, <H3>, <H4>, <H5>"headings" displayed as enlarged titles
    <P>paragraph (text set off from surrounding)
    <B>, <U>, <I>bold, underline, italic
    <PRE>"preformatted" -- displayed in monospace with spaces preserved
    <UL>, <OL>, <LI>numbered / bulleted lists
    <BR>drop down one line
    <IMG SRC="">embed an image
    <A HREF="">embed a hyperlink
    <TABLE>, <TR>, <TD>, <TH>table, row, cell, heading cell
    <FORM>, <INPUT>input form


    greeting.html: View this page here.

    <HTML>
      <HEAD>
        <TITLE>The Page of Greeting</TITLE>
      </HEAD>
      <BODY>
        <H1>Greetings!</H1>
        You've reached my page.
        <BR><BR><BR>
    
        Look at this IMAGE:<BR>
        <IMG SRC="../python_data/happup.jpg">
        <BR><BR>
    
        Visit this LINK:  <A HREF="https://www.yahoo.com/">yahoo!</A>
        <BR><BR><BR>
    
        Check out these PREFORMATTED FIGURES:<BR>
        <PRE>
               23.9      22.8     1117.0
                1.8      17.0       55.0
        </PRE>
        <BR><BR>
    
        Here is a TABLE:<BR>
        <TABLE border="1" width="75%" cellpadding="20" align="center">
          <TR>
            <TH>Date</TH><TH>City</TH><TH>Avg. Temp (F)</TH>
          </TR>
          <TR>
            <TD>2017-03-09</TD><TD>Hamburg</TD><TD align="right">63</TD>
          </TR>
          <TR>
            <TD>2017-03-10</TD><TD>Paree</TD><TD align="right">61</TD>
          </TR>
        </TABLE>
        <BR><BR><BR>
    
    
        Here is a FORM!<BR><BR>
        <FORM action="http://localhost:5000/">
          <INPUT NAME="id" TYPE="hidden">             # submits a non-visible values
          <B>What is your NAME?</B><BR>
          <INPUT NAME="name" size="50"><BR><BR>
          <B>What is your QUEST?</B><BR>
          <INPUT NAME="quest" size="50"><BR><BR>
          <B>What is your FAVORITE COLOR?</B><BR>
          <INPUT NAME="color" size="50"><BR><BR>
          <INPUT TYPE="submit" VALUE="submit answer!">
        </FORM>
    
      </BODY>
    </HTML>

    HTML tags: name, attributes, text

    Each of these pieces may be extracted from a tag.


    All HTML tags have names.

    <H1>Welcome to this Page</H1>                # An H1 tag
    
    <A HREF="mysite.html">click here</A>         # An A tag

    Many HTML tags have attributes.

    <IMG SRC="puppy.jpg">

    Most are "container" tags which mark content -- the text and/or tags found between the open and close tags. "Single" tags are called "empty".

    This is <B>important</B>.                    # "container"
    <meta name="viewport" content="alpha" />     # "empty" tag

    CSS and Bootstrap

    CSS is a styling and formatting language for web pages; Bootstrap is a framework for easy CSS styling.


    Cascading Style Sheets provide a language read by web pages that specify font, color, line, layout, etc., of HTML page elements. The spec can be used to style any web page. However, it can be time consuming to learn the many styling options and to keep them consistent within a website. Bootstrap was originally developed at Twitter and is now the go-to library for web page styling, especially suited for those who prefer not to have to master CSS. Bootstrap provides css for a wide range of HTML element stylings; it also provides javascript for "active" element effects like rollovers. Downloading Bootstrap Bootstrap is a set of files in folders. It can be downloaded as a .zip file. Download Bootstrap (find the download button under "Compiled CSS and JS" on this page: http://getbootstrap.com/docs/4.0/getting-started/download/)


    Adding Bootstrap to a Web page

    This process requires three steps:

  • placing the bootstrap folders on your system
  • creating a link to bootstrap files in your page
  • adding bootstrap <div> tags to elements in your page
  • .


    A common location for Bootstrap css and javascript is in a directory called static inside the directory where your web pages are located.

    htdocs\
       |---test.html        # a web page
       |---static\          # the 'static' directory
              |-----css\    # bootstrap 'css' directory
              |-----js\     # bootstrap 'js' directory

    Creating a link to bootstrap folders


    The CSS files for your page as well as the Javascript for effects are referenced in the page's <HEAD> tag. The URL for these files should be relative to the page location (in this case, relative to htdocs).

    <HEAD>
          <link href="static/css/bootstrap.min.css" rel="stylesheet">
          <script src="static/js/bootstrap.min.js"></script>
    </HEAD>>

    Note that the path static/css is relative to the directory that test.html is in (htdocs/). If the static folder were in the parent directory, you would enter ../static/css/bootstrap.min.css for this value instead.


    Adding Bootstrap <div> tags to your HTML


    The <div class="container"> tag creates a formatting container which puts spacing around your page elements and also responds to varying display form factors (such as a smart phone)

    <BODY>
      <div class="container">
        <H1>Hello!</H1>
      </div>
    </BODY>

    w3 Schools Bootstrap Examples

    The w3 Schools (world wide web schools) tutorials is a great place to learn Bootstrap options.


    The easiest way to experiment with Bootstrap is to take a look at examples and add them to your own pages to see the result. Core Examples https://www.w3schools.com/bootstrap/ List Groups https://www.w3schools.com/bootstrap/bootstrap_list_groups.asp Tables https://www.w3schools.com/bootstrap/bootstrap_tables.asp Forms https://www.w3schools.com/bootstrap/bootstrap_forms.asp Inputs https://www.w3schools.com/bootstrap/bootstrap_forms_inputs.asp Alerts https://www.w3schools.com/bootstrap/bootstrap_alerts.asp Images https://www.w3schools.com/bootstrap/bootstrap_images.asp The Well https://www.w3schools.com/bootstrap/bootstrap_wells.asp Buttons https://www.w3schools.com/bootstrap/bootstrap_buttons.asp Panels https://www.w3schools.com/bootstrap/bootstrap_panels.asp The Grid System https://www.w3schools.com/bootstrap/bootstrap_grid_basic.asp Cookbook http://getbootstrap.com/docs/4.0/examples/">http://getbootstrap.com/docs/4.0/examples/">http://getbootstrap.com/docs/4.0/examples/




    Matplotlib -- a Brief Introduction

    Matplotlib for visualizations

    matplotlib.pyplot is the plotting object; the figure is saved within pyplot


    * pylab was the original module meant to emulate Matlab functionality. * Matlab is a proprietary software product that provides a numeric computing environment. * matplotlib is the Python package of modules that now encompasses pylab and its emulation of Matlab's functionality. A clear tutorial on the central plotting function pyplot (part of which was used for this presentation) can be found here: https://matplotlib.org/users/pyplot_tutorial.html


    matplotlib.pyplot

    The matplotlib.pyplot module is our interface to plotting


    # plot a line from 0 to 10
    import numpy as np
    import matplotlib.pyplot as plt
    import random
    
    x = list(range(10))
    
    print(x)                   # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    plt.plot(x)                # plot a line graph with above values along a 10-point horizontal access
    plt.ylabel('some values')
    plt.xlabel('integer index')
    
    plt.savefig('myplot.png')

    * Note that in matplotlib.pyplot as plt, plt represents the plotting library, and much of what we do with our plots will take place through this module. * The calls to plt.plot(), plt.clf() and plt.savefig(), etc. show that the figure, the object that represents the "current" plot, "resides" in the module and not in an external variable. Individual figures can be returned and assigned to variables, but we often work with the default.




    Python Data Model

    Python's Data Model: Overview

    The Data Model specifies how objects, attributes, methods, etc. function and interact in the processing of data.


    The Python Language Reference provides a clear introduction to Python's lexical analyzer, data model, execution model, and various statement types. This session covers the basics of Python's data model. Mastery of these concepts allows you to create objects that behave (i.e., using the same interface -- operators, looping, subscripting, etc.) like standard Python objects, as well as in becoming conversant on StackOverflow and other discussion sites.


    Special / "Private" / "Magic" Attributes

    All objects contain "private" attributes that may be methods that are indirectly called, or internal "meta" information for the object.


    The __dict__ attribute shows any attributes stored in the object.

    >>> list.__dict__.keys()
    ['__getslice__', '__getattribute__', 'pop', 'remove', '__rmul__', '__lt__', '__sizeof__',
     '__init__', 'count', 'index', '__delslice__', '__new__', '__contains__', 'append',
     '__doc__', '__len__', '__mul__', 'sort', '__ne__', '__getitem__', 'insert',
     '__setitem__', '__add__', '__gt__', '__eq__', 'reverse', 'extend', '__delitem__',
     '__reversed__', '__imul__', '__setslice__', '__iter__', '__iadd__', '__le__', '__repr__',
     '__hash__', '__ge__']

    The dir() function will show the object's available attributes, including those available through inheritance.

    >>> dir(list)
    ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__',
     '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
     '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__',
     '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__',
     '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
     '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count',
     'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

    In this case, dir(list) includes attributes not found in list.__dict__. What class(es) does list inherit from? We can use __bases__ to see:

    >>> list.__bases__
    (object,)

    This is a tuple of classes from which list inherits - in this case, just the super object object.


    >>> object.__dict__.keys()
    ['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '__str__', '__format__',
     '__getattribute__', '__class__', '__delattr__', '__subclasshook__', '__repr__',
     '__hash__', '__sizeof__', '__doc__', '__init__']

    Of course this means that any object that inherits from object will have the above attributes. Most if not all built-in objects inherit from object. (In Python 3, all classes inherit from object.) (Note that of course the term "private" in this context does not refer to unreachable data as would be used in C++ or Java.)


    Object Inspection And Modification Built-in Functions

    Object Inspection

    isinstance()Checks to see if this object is an instance of a class (or parent class)
    issubclass()Checks to see if this class is a subclass of another
    callable()Checks to see if this object is callable
    hasattr()Checks to see if this object has an attribute of this name

    Object Attribute Modification

    setattr()sets an attribute in an object (using a string name)
    getattr()retrieves an attribute from an object (using a string name)
    delattr()deletes an attribute from an object (using a string name)


    Special Attributes: "operator overloading"

    Some special attributes are methods, usually called implictly as the result of function calls, the use of operators, subscripting or slicing, etc.


    We can replace any operator and many functions with the corresponding "magic" methods to achieve the same result:

    var = 'hello'
    var2 = 'world'
    
    print(var + var2)         # helloworld
    print(var.__add__(var2))  # helloworld
    
    print(len(var))           # 5
    print(var.__len__())      # 5
    
    if 'll' in var:
        print('yes')
    
    if var.__contains__('ll'):
        print('yes')

    Here is an example of a new class, Number, that reproduces the behavior of a number in that you can add, subtract, multiply, divide them with other numbers.

    class Number(object):
      def __init__(self, start):
        self.data = start
      def __sub__(self, other):
        return Number(self.data - other)
      def __add__(self, other):
        return Number(self.data + other)
      def __mul__(self, other):
        return Number(self.data * other)
      def __div__(self, other):
        return Number(self.data / float(other))
      def __repr__(self):
        print("Number value: ", end=' ')
        return str(self.data)
    
    X = Number(5)
    X = X - 2
    print(X)               # Number value: 3

    Of course this means that existing built-in objects make use of these methods -- you can find them listed from the object's dir() listing.


    Special Attributes: Reimplementing __repr__ and __str__

    __str__ is invoked when we print an object or convert it with str(); __repr__ is used when __str__ is not available, or when we view an object at the Python interpreter prompt.


    class Number(object):
      def __init__(self, start):
        self.data = start
      def __str__(self):
        return str(self.data)
      def __repr__(self):
        return 'Number(%s)' % self.data
    
    X = Number(5)
    print(X)          # 5  (uses __str__ -- without repr or str, would be <__main__.Y object at 0x105d61190>

    __str__ is intended to display a human-readable version of the object; __repr__ is supposed to show a more "machine-faithful" representation.


    Special attributes available in class design

    Here is a short listing of attributes available in many of our standard objects.


    You view see many of these methods as part of the attribute dictionary through dir(). There is also a more exhaustive exhaustive list with explanations provided by Rafe Kettler.

    object construction and destruction:

    __init__object constructor
    __del__del x (invoked when reference count goes to 0)
    __new__special 'metaclass' constructor

    object rendering:

    __repr__"under the hood" representation of object (in Python interpreter)
    __str__string representation (i.e., when printed or with str()

    object comparisons:

    __lt__<
    __le__<=
    __eq__==
    __ne__!=
    __gt__>
    __ge__>=
    __nonzero__(bool(), i.e. when used in a boolean test)

    calling object as a function:

    __call__when object is "called" (i.e., with ())

    container operations:

    __len__handles len() function
    __getitem__subscript access (i.e. mylist[0] or mydict['mykey'])
    __missing__handles missing keys
    __setitem__handles dict[key] = value
    __delitem__handles del dict[key]
    __iter__handles looping
    __reversed__handles reverse() function
    __contains__handles 'in' operator
    __getslice__handles slice access
    __setslice__handles slice assignment
    __delslice__handles slice deletion

    attribute access (discussed in upcoming session):

    __getattr__object.attr read: attribute may not exist
    __getattribute__object.attr read: attribute that already exists
    __setattr__object.attr write
    __delattr__object.attr deletion (i.e., del this.that)

    'descriptor' class methods (discussed in upcoming session)

    __get__when an attribute w/descriptor is read
    __set__when an attribute w/descriptor is written
    __delete__when an attribute w/descriptor is deleted with del

    numeric types:

    __add__addition with +
    __sub__subtraction with -
    __mul__multiplication with *
    __div__division with \/
    __floordiv__"floor division", i.e. with //
    __mod__modulus


    "Introspection" Special Attributes

    The name, module, file, arguments, documentation, and other "meta" information for an object can be found in special attributes.


    Below is a partial listing of special attributes; available attributes are discussed in more detail on the data model documentation page.

    user-defined functions

    __doc__doc string
    __name__this function's name
    __module__module in which this func is defined
    __defaults__default arguments
    __code__the "compiled function body" of bytecode of this function. Code objects can be inspected with the inspect module and "disassembled" with the dis module.
    __globals__global variables available from this function
    __dict__attributes set in this function object by the user

    user-defined methods

    im_classclass for this method
    __self__instance object
    __module__name of the module

    modules

    __dict__globals in this module
    __name__name of this module
    __doc__docstring
    __file__file this module is defined in

    classes

    __name__class name
    __module__module defined in
    __bases__classes this class inherits from
    __doc__docstring

    class instances (objects)

    im_classclass
    im_selfthis instance


    Variable Naming Conventions

    Underscores are used to designate variables as "private" or "special".


    lower-case separated by underscoresmy_nice_var"public", intended to be exposed to users of the module and/or class
    underscore before the name_my_private_var"non-public", *not* intended for importers to access (additionally, "from modulename import *" doesn't import these names)
    double-underscore before the name__dont_inherit"private"; its name is "mangled", available only as _classname__dont_inherit
    double-underscores before and after the name __magic_me__"magic" attribute or method, specific to Python's internal workings
    single underscore after the name file_used to avoid overwriting built-in names (such as the file() function)


    class GetSet(object):
    
        instance_count = 0
    
        __mangled_name = 'no privacy!'
    
        def __init__(self,value):
            self._attrval = value
            instance_count += 1
    
        def getvar(self):
            print('getting the "var" attribute')
            return self._attrval
    
        def setvar(self, value):
            print('setting the "var" attribute')
            self._attrval = value
    
    cc = GetSet(5)
    cc.var = 10
    print(cc.var)
    print(cc.instance_count)
    
    print(cc._attrval)                 # "private", but available:  10
    print(cc.__mangled_name)           # "private", apparently not available...
    print(cc._GetSet__mangled_name)    # ...and yet, accessible through "mangled" name
    
    cc.__newmagic__ = 10              # MAGICS ARE RESERVED BY PYTHON -- DON'T DO THIS

    Subclassing Builtin Objects

    Inheriting from a class (the base or parent class) makes all methods and attributes available to the inheriting class (the child class).


    class NewList(list):     # an empty class - does nothing but inherit from list
        pass
    
    x = NewList([1, 2, 3, 'a', 'b'])
    x.append('HEEYY')
    
    print(x[0])   # 1
    print(x[-1])  # 'HEEYY'

    Overriding Base Class Methods


    This class automatically returns a default value if a key can't be found -- it traps and works around the KeyError that would normally result.

    class DefaultDict(dict):
    
        def __init__(self, default=None):
            dict.__init__(self)
            self.default = default
    
        def __getitem__(self, key):
            try:
                return dict.__getitem__(self, key)
            except KeyError:
                return self.default
        def get(self, key, userdefault):
            if not userdefault:
                userdefault = self.default
            return dict.get(self, key, userdefault)
    
    xx = DefaultDict()
    
    xx['c'] = 5
    
    print(xx['c'])          # 5
    print(xx['a'])          # None

    Since the other dict methods related to dict operations (__setitem__, extend(), keys(), etc.) are present in the dict class, any calls to them also work because of inheritance.


    WARNING! Avoiding method recursion Note the bolded statements in DefaultDict above (as well as MyList below) -- are calling methods in the parent in order to avoid infinite recursion. If we were to call DefaultDict.get() from inside DefaultDict.__getitem__(), Python would again call DefaultDict.__getitem__() in response, and an infinite loop of calls would result. We call this infinite recursion


    The same is true for MyList.__getitem__() and MyList.__setitem__() below.

        # from DefaultDict.__getitem__()
        dict.get(self, key, userdefault)       # why not self.get(key, userdefault)?
    
        # from MyList.__getitem__()
        return list.__getitem__(self, index)   # why not self[index]?
    
        # from MyList.__setitem__()                   # (from example below)
        list.__setitem__(self, index, value)   # why not self[index] = value?

    Another example -- a custom list that indexes items starting at 0:

    class MyList(list):         # inherit from list
      def __getitem__(self, index):
        if index == 0:  raise IndexError
        if index > 0: index = index - 1
        return list.__getitem__(self, index)  # this method is called when we access
                                                     # a value with subscript (x[1], etc.)
      def __setitem__(self, index, value):
        if index == 0:  raise IndexError
        if index > 0: index = index - 1
        list.__setitem__(self, index, value)
    
    x = MyList(['a', 'b', 'c'])  # __init__() inherited from builtin list
    
    print(x)                      # __repr__() inherited from builtin list
    
    x.append('spam');            # append() inherited from builtin list
    
    print(x[1])                   # 'a' (MyList.__getitem__
                                 #      customizes list superclass method)
                                 # index should be 0 but it is 1!
    
    print(x[4])                   # 'spam' (index should be 3 but it is 4!)

    So MyList acts like a list in most respects, but its index starts at 0 instead of 1 (at least where subscripting is concerned -- other list methods would have to be overridden to complete this 1-indexing behavior).


    Iterator Protocol

    The protocol specifies methods to be implemented to make our objects iterable.


    "Iterable" simply means able to be looped over or otherwise treated as a sequence or collection. The for loop is the most obvious feature that iterates, however a great number of functions and other features perform iteration, including list comprehensions, max(), min(), sorted(), map(), filter(), etc., because each of these must consider every item in the collection.


    We can make our own objects iterable by implementing __iter__ and next, and by raising the StopIteration exception

    class Counter:
        def __init__(self, low, high):
            self.current = low
            self.high = high
    
        def __iter__(self):
            return self
    
        def __next__(self):                   # Python 3: def __next__(self)
            if self.current > self.high:
                raise StopIteration
            else:
                self.current += 1
                return self.current - 1
    
    
    for c in Counter(3, 8):
        print(c)

    Reading a file with 'with'

    A file is automatically closed upon exiting the 'with' block


    A 'best practice' is to open files using a 'with' block. When execution leaves the block, the file is automatically closed.

    with open('myfile.txt') as fh:
        for line in fh:
            print(line)
    
    ## at this point (outside the with block), filehandle fh has been closed.

    The conventional approach:

    fh = open('myfile.txt')
    for line in fh:
        print(line)
    
    fh.close()        # explicit close() of the file

    Although open files do not block other processes from opening the same file, they do leave a small additional temporary file on the filesystem (called a file descriptor); if many files are left open (especially for long-running processes) these small additional files could accumulate. Therefore, files should be closed as soon as possible.


    Implementing a 'with' context

    Any object definition can include a 'with' context; what the object does when leaving the block is determined in its design.


    A 'with' context is implemented using the magic methods __enter__() and __exit()__.

    class CustomWith:
        def __init__(self):
            """ when object is created """
            print('new object')
    
        def __enter__(self):
            """ when 'with' block begins (normally same time as __init__()) """
            print('entering "with"')
            return self
    
        def __exit__(self, exc_type, exc_value, exc_traceback):
            """ when 'with' block is left """
            print('leaving "with"')
    
            # if an exception should occur inside the with block:
            if exc_type:
                print('oops, an exception')
                raise exc_type(exc_value)     # raising same exception (optional)
    
    with CustomWith() as fh:
        print('ok')
    
    print('done')

    __enter__() is called automatically when Python enters the with block. This is usually also when the object is created with __init__() although it is possible to create __exit__() is called automatically when Python exits the with block. If an exception occurs inside the with block, Python passes the exception object, any value passed to the exception (usually a string error message) and a traceback string ("Traceback (most recent call last):...") In our above program, if an exception occurred (if type has a value) we are choosing to re-raise the same exception. Your program can choose any action at that point.


    Internal Types

    Some implicit objects can provide information on code execution.


    Traceback objects


    Traceback objects become available during an exception. Here's an example of inspection of the exception type using sys.exc_info()

    import sys, traceback
    try:
        some_code_i_wrote()
    except BaseException as e:
        error_type, error_string, error_tb =  sys.exc_info()
        if not error_type == SystemExit:
            print('error type:    {}'.format(error_type))
            print('error string:  {}'.format(error_string))
            print('traceback:     {}'.format(''.join(traceback.format_exception(error_type, e, error_tb))))

    Code objects In CPython (the most common distribution), a code object is a piece of compiled bytecode. It is possible to query this object / examine its attributes in order to learn about bytecode execution. A detailed exploration of code objects can be found here. Frame objects A frame object represents an execution frame (a new frame is entered each time a function is called). They can be found in traceback objects (which trace frames during execution).

    f_backprevious stack frame
    f_codecode object executed in this frame
    f_localslocal variable dictionary
    f_globalsglobal variable dictionary
    f_builtinsbuilt-in variable dictionary


    For example, this line placed within a function prints the function name, which can be useful for debugging -- here we're pulling a frame, grabbing the code object of that frame, and reading the attribute co_name to read it.

    import sys
    
    def myfunc():
        print('entering {}()'.format(sys._getframe().f_code.co_name ))

    Calling this function, the frame object's function name is printed:

    myfunc()         # entering myfunc()



    Attribute Access in Classes and Instances

    getattr() and setattr()

    These built-in functions allow attribute access through string arguments.


    class This(object):       # a simple class with one class variable
        a = 5
    
    x = This()
    
    print(getattr(x, 'a'))     # 5:  finds the 'a' attribute in the class
    
    setattr(x, 'b', 10)       #     set x.b = 10 in the instance
    
    print(x.b)                 # 10: retrieve x.b from the instance

    Similar to the dict method get(), a default argument can be passed to getattr() to return a value if the attribute doesn't exist (otherwise, a missing attribute will raise an AttributeError exception).

    class InstMake(object):        # create a featureless class so we can
                                   # play with its instance
        pass
    
    x = InstMake()
    
    curr_val = getattr(x, 'intval', 0)   # no 'intval' attribute, so default 0
    setattr(x, 'intval', 10)             # set 'intval' to 10
    
    print(x.__dict__)                     # {'intval': 10}

    We might want to use these functions as a dispatch utility: if our program is working with a string value that is the name of an attribute, we can use the string directly.

    var = 'hElLo'
    for methodname in ('upper', 'lower', 'title'):
        print(getattr(var, methodname)())            # call each method through the attribute
                                                    # HELLO
                                                    # hello
                                                    # Hello

    Special methods __getattr__, __getattribute__ and __setattr__

    These special methods are called when we access an attribute (setting or getting). We can implement them for broad control over our custom class' attributes.


    class MyClass(object):
    
        classval = 5
    
        # when read attribute does not exist
        def __getattr__(self, name):
            default = 0
            print('getattr: "{}" not found; setting default'.format(name))
            setattr(self, name, default)
            return default
    
        # when attribute is read
        def __getattribute__(self, name):
            print('getattribute:  attempting to access "{}"'.format(name))
            return object.__getattribute__(self, name)
    
        # when attribute is assigned
        def __setattr__(self, name, value):
            print('setattr:  setting "{}" to value "{}"'.format(name, value))
            self.__dict__[name] = value
    
    x = MyClass()
    
    x = MyClass()
    
    x.a = 5             # setattr:  setting "a" to value "5"
    
    print(x.a)           # getattribute:  attempting to access "__dict__"
                        # getattribute:  attempting to access "a"
                        # 5
    
    print(x.ccc)         # getattribute:  attempting to access "ccc"
                        # getattr: "ccc" not found; setting default
                        # setattr:  setting "ccc" to value "0"
                        # getattribute:  attempting to access "__dict__"
                        # 0

    __getattribute__: implicit call upon attribute read. Anytime we attempt to access an attribute, Python calls this method if it is implemented in the class. __getattr__: implicit call for non-existent attribute. If an attribute does not exist, Python calls this method -- regardless of whether it called __getattribute__. recursion alert: we must use alternate means of getting or setting attributes lest Python call these methods repeatedly: __getattribute__(): use object.__getattribute__(self, name) __setattr__(): use self.__dict__[name] = value e.g., use of self.attr = val in __setattr__ would cause the method to call itself.


    @property: attribute control

    This decorator allows behavior control when an individual attribute is accessed, through separate @property, @setter and @deleter methods.


    class GetSet(object):
    
        def __init__(self,value):
            self.attrval = value
    
        @property
        def var(self):
            print('thanks for calling me -- returning {}'.format(self.attrval))
            return self.attrval
    
        @var.setter
        def var(self, value):
            print("thanks for setting me -- setting 'var' to {}".format(value))
            self.attrval = value
    
        @var.deleter
        def var(self):
            print('should I thank you for deleting me?')
            self.attrval = None
    
    me = GetSet(5)
    
    me.var = 1000    # setting the "var" attribute
    
    print(me.var)     # thanks for calling me -- returning 1000
                     # 1000
    
    del me.var       # should I thank you for deleting me?
    print(me.var)     # thanks for calling me -- returning None
                     # None
    

    Note that each decorated method is called def var. This would cause conflicts if it weren't for the decorators.


    One caveat: since the interface for attribute access appears very simple, it can be misleading to attach computationally expensive operations to an attribute decorated with @property.


    Implementing Decorators

    The ability of a function to accept other functions as arguments, or to return functions as return values, are at the heart of the @decorator scheme.


    Python decorators are functions that modify the behavior of other functions. They can be added to any function through the use of the @ sign and the decorator name on the line above the function.


    Here's a simple example adapted from the Jeff Knupp blog:

    def currency(f):                              # decorator function
        def wrapper(*args, **kwargs):
            return '$' + str(f(*args, **kwargs))
        return wrapper
    
    @currency
    def price_with_tax(price, tax_rate_percentage):
        """Return the price with *tax_rate_percentage* applied.
        *tax_rate_percentage* is the tax rate expressed as a float, like
        "7.0" for a 7% tax rate."""
    
        return price * (1 + (tax_rate_percentage * .01))
    
    print(price_with_tax(50, .10))           # $50.05
    
    # 'manual' version of the above
    pwt = currency(price_with_tax)
    print(pwt(50, .10))                      # $$50.05
    

    In this example, *args and **kwargs represent "unlimited positional arguments" and "unlimited keyword arguments". This is done to allow the flexibility to decorate any function (as it will match any function argument signature).


    You may have reflected that @classmethod and @staticmethod are decorators. This means that there is a built-in object called classmethod and one called staticmethod and these are responsible for handling the arguments to any decorated function -- principally to allow calls that don't require an instance as first argument, which is the default behavior when methods are called on objects.

    class DoThis(object):
    
        def show_instance(self):   # default 'instance' method:  instance passed
            print(self)             # <__main__.DoThis object at 0x1004df590>
    
        @classmethod
        def show_class(cls):       # decorated 'class' method:  class is passed
            print(cls)              # <class '__main__.DoThis'>
    
        @staticmethod
        def say_hello():           # decorated 'static' method:  no implicit argument
            print('hello!')         # hello!
    
    x = DoThis()
    x.printme()                    # standard instance method:  object is impilcitly passed
    x.show_class()                 # class method:  class is implicitly passed
    x.say_hello()                  # static method:  no arg implicitly passed
    
    

    The benefit here is that, rather than requiring the user to explicitly pass a function to a processing function, we can simply decorate each function to be processed and it will behave as advertised.


    Attribute access: descriptors

    A descriptor is an attribute that is linked to a separate class that defines __get__(), __set__() or __delete__().


    class RevealAccess(object):
        """ A data descriptor that sets and returns values and prints a message declaring access. """
    
        def __init__(self, initval=None):
            self.val = initval
    
        def __get__(self, obj, objtype):
            print('Getting attribute from object', obj)
            print('...and doing some related operation that should take place at this time')
            return self.val
    
        def __set__(self, obj, val):
            print('Setting attribute from object', obj)
            print('...and doing some related operation that should take place at this time')
            self.val = val
    
    
    # the class we will work with directly
    class MyClass(object):
        """ A simple class with a class variable as descriptor """
        def __init__(self):
            print('initializing object ', self)
    
        x = RevealAccess(initval=0)  # attach a descriptor to class attribute 'x'
    
    
    mm = MyClass()                   # initializing object  <__main__.MyClass object at 0x10066f7d0>
    
    mm.x = 5                         # Setting attribute from object <__main__.MyClass object at 0x1004de910>
                                     # ...and doing some related operation that should take place at this time
    
    val = mm.x                       # Getting attribute from object <__main__.MyClass object at 0x1004de910>
                                     # ...and doing some related operation that should take place at this time
    
    print('retrieved value: ', val)   # retrieved value:  5

    You may observe that descriptors behave very much like the @property decorator. And it's no coincidence: @property is implemented using descriptors.


    __slots__

    This class variable causes object attributes to be stored in a specially designated space rather than in a dictionary (as is customary).


    class MyClass(object):
        __slots__ = ['var', 'var2', 'var3']
    
    a = MyClass()
    
    a.var = 5
    a.var2 = 10
    a.var3 = 20
    a.var4 = 40   # AttributeError:  'MyClass' object has no attribute 'var4'

    All objects store attributes in a designated dictionary under the attribute __dict__. This takes up a fairly large amount of memory space for each object created. __slots__, initialized as a list and as a class variable, causes Python not to create an object dictionary; instead, just enough memory needed for the attributes is allocated. When many instances are being created a marked improvement in performance is possible. The hotel rating website Oyster.com reported a problem solution in which they reduced their memory consumption by a third by using __slots__. Please note however that slots should not be used to limit the creation of attributes. This kind of control is considered "un-pythonic" in the sense that privacy and control are mostly cooperative schemes -- a user of your code should understand the interface and not attempt to subvert it by establishing unexpected attributes in an object.




    SQL Part 3: CREATE TABLE and INSERT INTO

    Special Note: sqlite3 client column formatting

    Issue these two commands at the start of your sqlite> session


    At the start of your session, issue the following two commands -- these will format your sqlite3 output so it is clearer, and add columns headers.

    sqlite> .mode column
    sqlite> .headers on
    sqlite> SELECT * FROM students;
    
                  company     state       cost
                  ----------  ----------  ----------
                  Haddad's    PA          239.5
                  Westfield   NJ          53.9
                  The Store   NJ          211.5
                  Hipster's   NY          11.98
                  Dothraki F  NY          5.98
                  Awful's     PA          23.95
                  The Clothi  NY          115.2

    You may note that columns are 10 characters wide and that longer fields are cut off. You can set the width with values for each column for example .width 5 13 11 5 5 for the above table. Unfortunately this must be done separately for each table.


    Relational Table Structure and CREATE TABLE

    Table columns specify a data type.


    From the command line:

    $ sqlite3
    
    sqlite> .schema students
    CREATE TABLE revenue (company TEXT, state TEXT, cost FLOAT);

    .schema shows us the statement used to create this table. (In other databases, the DESC [tablename] statement shows table columns and types.) As you can see, each column is paired with a column type, which describes what kind of data can be stored in that column. To create a new table, we must specify a type for each column.

    type data stored
    INTEGER integers
    FLOAT floating-point values
    REAL numbers larger than INT
    TEXT string values
    BLOB binary or other data


    INSERT INTO statement and conn.commit()

    This statement adds a row (or rows) to a table. conn.commit() commits the transaction.


    # names fields to match values
    cursor.execute("INSERT INTO revenue (company, state, cost) VALUES ('IBM', 'NY', 3.50)")
    conn.commit()
    
    # values only -- assumes column order in table
    cursor.execute("INSERT INTO revenue VALUES ('IBM', 'NY', 3.50)")
    conn.commit()

    As with code, proper syntax is essential. Careful with quotes! The main challenge with SQL queries is in proper use of quotation marks. TEXT fields must be enquoted while numeric feels must not be. Lastly, the statement can only be committed when we call commit() on the connection object.


    sqlite3.OperationalError

    This exception is generated by SQLite3, usually when our query syntax is incorrect.


    When you receive this error, it means you have asked sqlite3 to do something it is unable to do. In these cases, you should print the query just before the execute() statement.


    query = "insert into revenue values ('Acme', 'CA')"
    c.execute(query)
    
    Traceback (most recent call last):
      File "", line 1, in 
    sqlite3.OperationalError: table revenue has 3 columns but 2 values were supplied

    A common issue is the use of single quotes inside a quoted value:

    query = "insert into revenue values ('Awful's', 'NJ', 20.39)"
    c.execute(query)
    
    Traceback (most recent call last):
      File "", line 1, in 
    sqlite3.OperationalError: near "A": syntax error

    Looking closely at the query above, you see that the name "Awful's" has a quote in it. So when SQL attempts to parse this string, it becomes confused by the quote in the text. There are 3 possible solutions to this particular problem: 1. escape the quote (use 2 quotes instead of one in the quoted string) 2. use double-quotes around the string in the query 3. use parameterized arguments (preferred, see next)


    INSERT INTO with parameterized arguments

    We can 'inject' data into our INSERT query dynamically, similar to .format().


    co = 'IBM'
    state = 'NY'
    rev = 3.50
    
    # names fields to match values
    query = "INSERT INTO revenue (company, state, cost) VALUES (?, ?, ?)"
    cursor.execute(query, (co, state, rev))
    conn.commit()

    This accomplishes two things: we can develop a string template (similar to a .format() template) into which we can insert data that is dynamically accessed (during program run); more importantly, we can avoid any syntax issues that would occur for the use of quotes in the data being used.


    DELETE FROM

    Delete some or all rows with this query (take care!)


    DELETE FROM removes rows from a table.

    DELETE FROM students WHERE student_id = 'jk43'

    Take special care -- DELETE FROM with no critera will empty the table!

    DELETE FROM students

    WARNING -- the above statement clears out the table!




    Jinja2 Templating

    Jinja2: Getting Started

    Jinja2 is a module for inserting data into templates.


    Last week we used the simple .format() method to insert "dynamic" data (that is, variables that may be any value) into "static" templates (text that does not change). Jinja2 offers a full-featured templating system, including the following features:


    Here is a basic example showing these features:


    test.html, stored the test_templates/ directory

    Hello, {{ name }}.
    
    Please say it loudly:  {{ compliment.upper() }}!
    
    Must I tell you:
    {% for item in pronouncements %}
       {{ item }}
    {% endfor %}
    
    Or {{ pronouncements[0] }}?
    
    {% if not reconciled: %}
      We have work left to do.
    {% else %}
      I'm glad we worked that out.
    {% endif %}

    test.py, in the same dir as test_templates

    import jinja2
    
    env = jinja2.Environment()
    env.loader = jinja2.FileSystemLoader('test_templates')
    
    template = env.get_template('test.html')
    
    print(template.render(name='Joe', compliment="you're great",
                          pronouncements=['over', 'over', 'over again'],
                          reconciled=False))

    The rendered template!

    Hello, Joe.
    
    Please say it loudly:  YOU'RE GREAT!
    
    Must I tell you:
    
       over
       over
       and over again
    
    Or over?
    
    
      We have work left to do.



    pandas part 1: Introduction

    pandas and numpy: Introduction

    pandas is a Python module used for manipulation and analysis of tabular data. * Excel-like numeric calculations, particularly column-wise and row-wise calculations (vectorization) * SQL-like merging, grouping and aggregating * ability to read and write to CSV, XML, Excel, database queries, etc. * emphasis on: - aligning data from multiple sources - "slicing and dicing" by rows and columns - cleaning and normalizing missing or incorrect data - aggregating and categorizing - working with time series numpy is the data analysis library upon which pandas is built. We sometimes make direct calls to numpy - some of its variables (such as np.nan), variable-generating functions (such as np.arange or np.linlist) and some processing functions.


    Learning about pandas Functions and Object Methods

    We can list the object's attributes with dir() and see brief documentation on an attribute with help().


    pandas attributes

    import pandas as pd
    
    # list of pandas attributes (global vars)
    print(dir(pd))

    pandas functions (filtering for <class 'function'> only)

    import types
    
    for attr in dir(pd):
        if isinstance(getattr(pd, attr), types.FunctionType):
            print(attr)
    
    # short documentation on read_csv() function
    help(pd.read_csv)       # help on the read_csv function of pandas

    DataFrame attributes (methods and other)

    df = pd.DataFrame()
    print(dir(df))
    
    # list of DataFrame methods (filtering for <class 'method'> only)
    for attr in dir(df):
        if isinstance(getattr(df, attr), types.MethodType):
            print(attr)
    
    # short doc on DataFrame join() method
    help(df.join)           # help on the join() method of a DataFrame



    pandas Reference and Tutorials

    Use the docs for an ongoing study of pandas' rich feature set.


    pandas official documentation


    full docs (HTML, pdf)

    http://pandas.pydata.org/pandas-docs/stable
    http://pandas.pydata.org/pandas-docs/version/0.19.0/pandas.pdf

    "10 minutes to pandas"

    https://pandas.pydata.org/pandas-docs/stable/10min.html

    pandas cookbook

    http://pandas.pydata.org/pandas-docs/stable/cookbook.html



    matplotlib official documentation

    http://matplotlib.org/api/pyplot_api.html

    pandas textbook "Python for Data Analysis" by Wes McKinney


    http://www3.canisius.edu/~yany/python/Python4DataAnalysis.pdf

    (If the above link goes stale, simply search Python for Data Analysis pdf.)


    The new (Oct 2017) Second Edition is available in "raw/unedited" form from O'Reilly on Safari Bookshelf:

    http://shop.oreilly.com/product/0636920050896.do

    Please keep in mind that pandas is in active development, which means that features may be added, removed and changed (latest version: 0.20.3)

    blog tutorials

    These often provide the kind of "insider view" that is most helpful when getting oriented.


    Tom Augspurger blog (6-part series)

    http://tomaugspurger.github.io/modern-1.html

    Greg Reda blog (3-part series)

    http://gregreda.com/2013/10/26/intro-to-pandas-data-structures/

    cheat sheet (Treehouse)

    https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf





    pandas part 2: DataFrame, Series, Index; column dtypes

    pandas Object Types: DataFrame, Series, Index

    DataFrame: rows and columns; Series: a single column or single row; Index: column or row labels.


    DataFrame: * is the core pandas structure -- a 2-dimensional array or list of lists * is like an Excel spreadsheet - rows, columns, and row and column labels * is also like a "dict of dicts" in that it holds column- and row-indexed Series * offers "vectorized" operations (sum rows or columns, modify values across rows, etc.) * offers database-like and excel-like manipulations (merge, groupby, pivot table etc.)


    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 3],
                       'b': [10, 20, 30],
                       'c': [100, 200, 300]},
                       index=['r1', 'r2', 'r3'])
    
    print(df)
                #      a   b    c
                # r1   1  10  100
                # r2   2  20  200
                # r3   3  30  300
    
    print(df['c']['r2'])    # 200

    Series * a "dictionary-like list" -- ordered values by associating them with an index * has a dtype attribute that holds its objects' common type


    # read a column as a Series (use DataFrame subscript)
    bcol = df['b']
    print(bcol)
                # r1    10
                # r2    20
                # r3    30
                # Name: b, dtype: int64
    
    
    # read a row as a Series (use subscript of df.loc[])
    oneidx = df.loc['r2']
    print(oneidx)
                # a      2
                # b     20
                # c    200
                # Name: r2, dtype: int64

    Index * an object that provides indexing for both the Series (its item index) and the DataFrame (its column or row index).


    columns = df.columns   # Index(['a',  'b',  'c'],  dtype='object')
    idx = df.index         # Index(['r1', 'r2', 'r3'], dtype='object')
    

    The DataFrame: Initializing and Subscripting

    A dataframe can be indexed like a list and subscripted like a dictionary.


    import pandas as pd
    import numpy as np
    
    # initialize a new, empty DataFrame
    df = pd.DataFrame()
    
    # initialize a DataFrame with sample data
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df)
    
    # a DataFrame, printed
    #     a    b  c
    # r1  1  1.0  a
    # r2  2  1.5  b
    # r3  3  2.0  c
    # f4  4  2.5  d
    
    

    DataFrame subscript: column Series


    s = df['a']
    
    print(s)        # r1    1
                    # r2    2
                    # r3    3
                    # r4    4
                    # Name: a, dtype: int64
    
    print(type(s))   # Series

    The Series: Initializing and Subscripting

    A Series is pandas object representing a column or row in a DataFrame.


    Every DataFrame column or row is a Series:

    df = pd.DataFrame( {'a': [1, 2],
                        'b': [8, 9] },
                        index=['r1', 'r2'] )
    
    print(df)
                               #     a  b
                               # r1  1  8
                               # r2  2  9
    
    # DataFrame string subscript accesses a column
    print(df['a'])             # 0    1
                               # 1    2
                               # Name: a, dtype: int64
    
    print(type(df['a']))       # <class 'pandas.core.series.Series'>
    
    
    # DataFrame .loc[] indexer accesses the rows
    print(df.loc['r1'])        # a    1
                               # b    8
                               # Name: r1, dtype: int64
    
    print(type(df.loc['r1']))  # <class 'pandas.core.series.Series'>

    A Series can be also be initialized on its own:

    s1 = pd.Series([1, 2, 3, 4])
    s2 = pd.Series([5, 6, 7, 8])

    We can combine Series into DataFrames:

    df = pd.DataFrame([s1, s2])       # add Series as rows
    
    df = pd.concat([s1, s2], axis=1)  # add Series as columns

    The Index: DataFrame Column or Index Labels

    An Index object is used to specify a DataFrame's columns or index, or a Series' index.


    Columns and Indices


    A DataFrame makes use of two Index objects: one to represent the columns, and one to represent the rows.

    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'],
                        'd': [100, 200, 300, 400] },
                        index=['r1', 'r2', 'r3', 'r4'] )
    print(df)
        #     a    b  c    d
        # r1  1  1.0  a  100
        # r2  2  1.5  b  200
        # r3  3  2.0  c  300
        # r4  4  2.5  d  400

    .rename() method: columns or index labels can be reset using this DataFrame method.

    df = df.rename(columns={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'},
                   index={'r1': 'R1', 'r2': 'R2', 'r3': 'R3', 'r4': 'R4'})
    print(df)
        #     A    B  C    D
        # R1  1  1.0  a  100
        # R2  2  1.5  b  200
        # R3  3  2.0  c  300
        # R4  4  2.5  d  400

    .columns, .index: the columns or index can also be set directly using the DataFrame's attributes (this would have the same effect as above):

    df.columns = ['A', 'B', 'C', 'D']
    df.index = ['r1', 'r2', 'r3', 'r4']

    .set_index(): set any column to the index

    df2 = df.set_index('A')
    print(df2)
        #      B  C    D
        # A
        # 1  1.0  a  100
        # 2  1.5  b  200
        # 3  2.0  c  300
        # 4  2.5  d  400

    .reset_index(): we can reset the index to integers starting from 0; by default this converts the previous into a new column:

    df3 = df.reset_index()
    print(df3)
        #   index  A    B  C    D
        # 0    R1  1  1.0  a  100
        # 1    R2  2  1.5  b  200
        # 2    R3  3  2.0  c  300
        # 3    R4  4  2.5  d  400

    or to drop the index when resetting, include drop=True

    df4 = df.reset_index(drop=True)
    print(df4)
        #    A    B  C    D
        # 0  1  1.0  a  100
        # 1  2  1.5  b  200
        # 2  3  2.0  c  300
        # 3  4  2.5  d  400

    .reindex(): we can change the order of the indices and thus the rows:

    df5 = df.reindex(reversed(df.index))
    
    df5 = df5.reindex(columns=reversed(df.columns))
    
    print(df5)
    
                          #  state  A    B  C    D
                          #  year
                          #  R1     1  1.0  a  100
                          #  R2     2  1.5  b  200
                          #  R3     3  2.0  c  300
                          #  R4     4  2.5  d  400

    we can set names for index and column indices:

    df.index.name = 'year'
    df.columns.name = 'state'

    There are a number of "exotic" Index object types: Index (standard, default and most common Index type) RangeIndex (index built from an integer range) Int64Index, UInt64Index, Float64Index (index values of specific types) DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex (datetime-related indices) CategoricalIndex (index related to the Categorical type)


    DataFrame: Initializing from Data Source

    DataFrame can be read from CSV, JSON, Excel and XML formats.


    Note: df is a common variable name for pandas DataFrame objects; you will see this name used frequently in these examples.


    CSV

    # read from file
    df = pd.read_csv('quarterly_revenue_2017Q4.csv')
    
    
    # write to file
    wfh = open('output.csv', 'w')
    df.to_csv(wfh, na_rep='NULL')
    
    
    # reading from Fama-French file (the abbreviated file, no header)
    # sep= indicates the delimiter on which to split() the fields
    # names= indicates the column heads
    df = pd.read_csv('FF_abbreviated.txt', sep='\s+',
                                           names=['date', 'MktRF', 'SMB', 'HML', 'RF'])
    
    
    # reading from Fama-French non-abbreviated (the main file including headers and footers)
    # skiprows=5:  start reading 5 rows down
    df = pd.read_csv('F-F_Research_Data_Factors_daily.txt', skiprows=5, sep='\s+',
                                                            names=['date', 'MktRF', 'SMB', 'HML', 'RF'])
    
    df.to_csv('newfile.csv')

    Excel

    # reading from excel file to DataFrame
    df = pd.read_excel('revenue.xlsx', sheet_name='Sheet1')
    
    # optional:  produce a 'reader' object used to obtain sheet names, etc.
    xls_file = pd.ExcelFile('data.xls')    # produce a file 'reader' object
    df = xls_file.parse('Sheet1')          # parse a selected sheet
    
    
    # write to excel
    df.to_excel('data2.xls', sheet_name='Sheet1')

    JSON

    # sample df for demo purposes
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    
    
    # write dataframe to JSON
    pd.json.dump(df, open('df.json', 'w'))
    
    mydict = pd.json.load(open('df.json'))
    new_df = pd.DataFrame(mydict)

    Relational Database

    import sqlite3                        # file-based database format
    conn = sqlite3.connect('example.db')  # a db connection object
    
    df = pd.read_sql('SELECT this FROM that', conn)

    The above can be used with any database connection (MySQL, Oracle, etc.)


    From Clipboard: this option is excellent for cutting and pasting data from websites

    df = pd.read_clipboard(skiprows=5, sep='\s+',
                           names=['date', 'MktRF', 'SMB', 'HML', 'RF'])

    Dataframe and Series dtypes

    pandas deduces a column type based on values and applies it to the column automatically.


    Pandas is built on top of numpy, a numeric processing module, compiled in C for efficiency. Unlike core Python containers (but similar to a database table), numpy cares about object type. Wherever possible, numpy will assign a type to a column of values and attempt to maintain the type's integrity. This is done for the same reason it is done with database tables: speed and space efficiency. In the below DataFrame, numpy/pandas "sniffs out" the type of a column Series. It will set the type most appropriate to the values.


    import pandas as pd
    
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'],
                        'd': ['2016-11-01', '2016-12-01', '2017-01-01', '2018-02-01'] },
                        index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df)
                      #     a    b  c          d
                      # r1  1  1.0  a 2016-11-01
                      # r2  2  1.5  b 2016-12-01
                      # r3  3  2.0  c 2017-01-01
                      # r4  4  2.5  d 2017-02-01
    
    print(df.dtypes)
    
                      # a      int64       # note special pandas types int64 and float64
                      # b    float64
                      # c     object       # 'object' is general-purpose type,
                      # d     object       #     covers strings or mixed-type columns
                      # dtype: object

    You can use the regular integer index to set element values in an existing Series. However, the new element value must be the same type as that defined in the Series; if not, pandas may refuse, or it may upconvert or cast the Series column to a more general type (usually object, because numpy seems to care only about numeric and datetime types).


    print(df.b.dtype)       # float64
    df.loc['b'] = 'hello'
    print(df.b.dtype)       # object

    Note that we never told pandas to store these values as floats. But since they are all floats, pandas decided to set the type.


    We can change a dtype for a Series ourselves with .astype():

    df.a = df.a.astype('object')     # or df['a'] = df['a'].astype('object')
    
             #  df.loc[0, 'a'] = 'hello'

    The numpy dtypes you are most likely to see are:

    int64
    float64
    datetime64
    object

    Checking the memory usage of a DataFrame


    .info() provides approximate memory size of a DataFrame

    df.info()   # on the original example at the top
    
             #  
             #  Index: 4 entries, r1 to r4
             #  Data columns (total 4 columns):
             #  a    4 non-null int64
             #  b    4 non-null float64
             #  c    4 non-null object
             #  d    4 non-null object
             #  dtypes: float64(1), int64(1), object(2)
             #  memory usage: 160.0+ bytes

    '+' means "probably larger" -- info() only sizes numeric types, not 'object'


    With memory_usage='deep', size includes type'object'

    df.info(memory_usage='deep')
    
             # memory usage: 832 bytes

    Selecting a Series from a DataFrame

    Use a subscript (or attribute) to access columns by label; use the .loc[] or .iloc[] attributes to access rows by label or integer index.


    a DataFrame:

    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'],
                        'd': [100, 200, 300, 400] },
                        index=['r1', 'r2', 'r3', 'r4'] )

    access column as Series:

    cola = df['a']       # Series with [1, 2, 3, 4] and index ['r1', 'r2', 'r3', 'r4']
    
    cola = df.a          # same -- can often use attribute labels for column name
    
    print(cola)
    
                # r1    1
                # r2    2
                # r3    3
                # r4    4
                # Name: a, dtype: int64

    access row as Series using index label 'r2':

    row2 = df.loc['r2']  # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd']

    access row as Series using integer index:

    row2 = df.iloc[1]    # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd'] (same as above)
    
    print(row2)
    
                # a      2
                # b    1.5
                # c      b
                # d    200
                # Name: r2, dtype: object

    (Note that the .ix DataFrame indexer is a legacy feature and is deprecated.)


    Manipulating the Index

    An Index can be set with a column or other sequence.


    Sometimes a pd.read_excel() includes index labels in the first column. We can easily set the index with .set_index():


    print(df)
        #     0  a    b  c           d
        # 0  r1  1  1.0  a  2016-11-01
        # 1  r2  2  1.5  b  2016-12-01
        # 2  r3  3  2.0  c  2017-01-01
        # 3  r4  4  2.5  d  2018-02-01
    
    df = df.set_index(df[0])
    df = df[['a', 'b', 'c', 'd']]
    print(df)
        #     a    b  c           d
        # 0
        # r1  1  1.0  a  2016-11-01
        # r2  2  1.5  b  2016-12-01
        # r3  3  2.0  c  2017-01-01
        # r4  4  2.5  d  2018-02-01

    We can reset the index with .reset_index(), although this makes the index into a new column.


    df2 = df.reset_index()
    print(df2)
        #     0  a    b  c           d
        #
        # 0  r1  1  1.0  a  2016-11-01
        # 1  r2  2  1.5  b  2016-12-01
        # 2  r3  3  2.0  c  2017-01-01
        # 3  r4  4  2.5  d  2018-02-01

    As mentioned we can move a column into the index with .set_index()

    df2 = df2.set_index(0)
    print(df2)
        #     a    b  c           d
        # 0
        # r1  1  1.0  a  2016-11-01
        # r2  2  1.5  b  2016-12-01
        # r3  3  2.0  c  2017-01-01
        # r4  4  2.5  d  2018-02-01

    To reset the index while dropping the original index, we can use drop=True:


    df3 = df2.reset_index(drop=True)
    print(df3)
        #    a    b  c           d
        #
        # 0  1  1.0  a  2016-11-01
        # 1  2  1.5  b  2016-12-01
        # 2  3  2.0  c  2017-01-01
        # 3  4  2.5  d  2018-02-01

    We can also sort the DataFrame by index using .sort_index():


    df4 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
                       index=['Cello', 'Alpha', 'Bow'])
    df5 = df4.sort_index()
    print(df5)
        #        a  b
        # Alpha  2  5
        # Bow    3  6
        # Cello  1  4

    The default is to sort by the row index; axis=1 allows us to sort by columns. ascending=False reverses the sort:


    df6 = df5.sort_index(axis=1, ascending=False)
    print(df6)
        #        b  a
        # Alpha  5  2
        # Bow    6  3
        # Cello  4  1

    Note that .sort_values() offers the same options sorting a specified column or row.


    Most Operations Produce a New DataFrame

    Some DataFrame operations provide the inplace=True option


    Keep in mind that many operations produce a new DataFrame copy. This means that if you are working with a large dataset, you can avoid allocating additional memory with this option.


    import pandas as pd
    
    df = pd.DataFrame({ 'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd']   })
    
    print(df)
        #     a    b  c
        # r1  1  1.0  a
        # r2  2  1.5  b
        # r3  3  2.0  c
        # r4  4  2.5  d
    
    df2 = df.set_index('a')
    
    print(df2)             # new dataframe
    
        #      b  c
        # a
        # 1  1.0  a
        # 2  1.5  b
        # 3  2.0  c
        # 4  2.5  d
    
    print(df)              # unchanged
        #     a    b  c
        # r1  1  1.0  a
        # r2  2  1.5  b
        # r3  3  2.0  c
        # r4  4  2.5  d
    
    df.set_index('a', inplace=True)
    
    print(df)
    
        #      b  c
        # a
        # 1  1.0  a
        # 2  1.5  b
        # 3  2.0  c
        # 4  2.5  d

    DataFrame and Series as list, set, etc.

    DataFrames behave as you might expect when converted to any Python container


    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    
    print(len(df))             # 4
    
    print(len(df.columns))     # 3
    
    print(max(df['a']))        # 4
    
    print(list(df['a']))       # [1, 2, 3, 4]     (column for 'a')
    
    print(list(df.loc['r2']))  # [2, 1.5, 'b']   (row for 'r2')
    
    print(set(df['c']))        # {'b', 'a'}       (a set of unique values)

    DataFrame .values -- convert to a list of numpy arrays


    An numpy array is a list-like object. simple list comprehension could convert these to a list of lists:

    print((df.values))
    
        # array([[1, 1.0, 'a'],
        #        [2, 1.5, 'b'],
        #        [3, 2.0, 'b'],
        #        [4, 2.5, 'a']], dtype=object)
    
    
    lol = list( [ list(item) for item in df.values ])
    
    print(lol)
    
                              # [ [1, 1.0, 'a'],
                              #   [2, 1.5, 'b'],
                              #   [3, 2.0, 'b'],
                              #   [4, 2.5, 'a'] ]

    looping - loops through columns


    for colname in df:
        print('{}:  {}'.format(colname, type(df[colname])))
    
                              # a:  <pandas.core.series.Series>
                              # b:  <pandas.core.series.Series>
                              # c:  <pandas.core.series.Series>
    
    
    # looping with iterrows -- loops through rows
    for row_index, row_series in df.iterrows():
        print('{}:  {}'.format(row_index, type(row_series)))
    
                              # r1:  <pandas.core.series.Series>
                              # r2:  <pandas.core.series.Series>
                              # r3:  <pandas.core.series.Series>
                              # r4:  <pandas.core.series.Series>

    Although keep in mind that we generally prefer vectorized operations across columns or rows to looping.




    pandas part 3: Subscripting, Slicing, Joining and Sorting

    slicing

    DataFrames can be sliced along a column or row (Series) or both (DataFrame)


    Access a Series object through DataFrame column or index labels Again, we can apply any Series operation on any of the Series within a DataFrame - slice, access by Index, etc.


    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'],
                        'd': [100, 200, 300, 400] },
                        index=['r1', 'r2', 'r3', 'r4'] )
    
    print(dfi['b'])
        # r1     1.0
        # r2     1.5
        # r3     2.0
        # r4     2.5
        # Name: b
    
        # print df['b'][0:3]
        # r1     1.0
        # r2     1.5
        # r3     2.0
    
        # dfi['b']['r2']
        # 1.5

    Create a DataFrame from columns of another DataFrame Oftentimes we want to eliminate one or more columns from our DataFrame. We do this by slicing Series out of the DataFrame, to produce a new DataFrame:


    >>> dfi[['a', 'c']]
         a   c
    r1   1   a
    r2   2   b
    r3   3   c
    r4   4   d

    Far less often we may want to isolate a row from a DataFrame - this is also returned to us as a Series. Note the column labels have become the Series index, and the row label becomes the Series Name. 2-dimensional slicing A double subscript can select a 2-dimensional slice (some rows and some columns).


    df[['a', 'b']]['alpha': 'gamma']

    Also note carefully the list inside the first subscript.


    Mask: Conditional Slicing

    Oftentimes we want to select rows based on row criteria (i.e., conditionally). To do this, we establish a mask, which is a test placed within subscript-like square brackets.


    Selecting rows based on column criteria:


    import pandas as pd
    
    df = pd.DataFrame( { 'a': [1, 2, 3, 4],
                         'b': [-1.0, -1.5, 2.0, 2.5],
                         'c': ['a', 'b', 'c', 'd']  }, index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df)
    
                        #     a    b  c
                        # r1  1 -1.0  a
                        # r2  2 -1.5  b
                        # r3  3  2.0  c
                        # r4  4  2.5  d
    
    print(df[ df['b'] < 0 ])              # select rows where 'b' value is < 0
    
                        #     a    b  c
                        # r1  1 -1.0  a
                        # r2  2 -1.5  b

    The mask by itself returns a boolean Series. Its values indicate whether the test return True for the value in the tested Series. This mask can of course be assigned to a name and used by name, which is common for complex criteria:

    mask = df['a'] > 2
    print(mask)          # we are printing this just for illustration
    
                        # r1    False
                        # r2    False
                        # r3    True
                        # r4    True
                        # Name: a, dtype: bool
    
    
    print(df[ mask ])
                        #     a    b  c
                        # r3  3  2.0  c
                        # r4  4  2.5  d

    negating a mask


    a tilde (~) in front of a mask creates its inverse:

    mask = df['a'] > 2
    print(df[ ~mask ])
    
                      #     a    b  c
                      # r1  1 -1.0  a
                      # r2  2 -1.5  b

    compound tests in a mask use & for 'and', | for 'or', and ( ) to separate tests


    The parentheses are needed to disambiguate the parts of the compound test.

    print(df[ (df.a > 3) & (df.a < 5) , 'b' ])
    
                      #     a    b  c
                      # r4  4  2.5  d

    Avoiding the 'copy of a slice' warning

    Use .loc[] rather than subscript slices for a guaranteed write to a slice


    We often begin work by reading a large dataset into a DataFrame, then slicing out a meaningful subset (eliminating columns and rows that are irrelevant to our analysis). Then we may wish to make some changes to the slice, or add columns to the slice. A recurrent problem in working with slices is that standard slicing may produce a link into the original data, or it may produce a temporary "copy". If a change is made to a temporary copy, our working data will not be changed.


    Here we are creating a slice by using a double subscript:

    dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                       'c2': [5,   6,  7, 8,   9],
                       'c3': [10, 11, 12, 13, 14],
                       'c4': [15, 16, 17, 18, 19],
                       'c5': [20, 21, 22, 23, 24],
                       'c6': [25, 26, 27, 28, 29] },
               index = ['r1', 'r2', 'r3', 'r4', 'r5'])
    
    dfi_prime = dfi[ dfi['c1'] > 2 ]
    print(dfi_prime)
    
                      #     c1  c2  c3  c4  c5  c6
                      # r4   3   8  13  18  23  28
                      # r5   4   9  14  19  24  29
    
    dfi_prime['c3'] = dfi_prime['c1'] * dfi_prime['c2']
    
    
    A value is trying to be set on a copy of a slice from a DataFrame
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

    Note that in some cases this warning will not appear, and in others the warning will appear and yet the change will have taken effect.


    The same problem may occur with a simple slice selection:

    myslice = dfi[ ['c1', 'c2', 'c3'] ]
    print(myslice)
    
                      #     c1  c2  c3  c4  c5  c6
                      # r3   2   7  12  17  22  27
                      # r4   3   8  13  18  23  28
                      # r5   4   9  14  19  24  29
    
    myslice['c3'] = myslice['c1'] * myslice['c2']
    print(myslice)

    The problem here is that pandas cannot guarantee whether the slice is a view on the original data, or a temporary copy! If a view, a change will not take effect! What's particularly problematic about this warning is that we may not always see it in these situations. We may also see false positives and false negatives, as is acknowledged in the documentation. The takeaway is that whether or not you see a warning, whether or not you see the change when you run it, you should never trust that you can change a value on a slice, even if it seems to work when testing.


    The solution is to use .loc or .iloc:

    filtered = dfi.loc[ dfi['c3'] > 11, : ]     # filter by column, include all rows

    Keep in mind that you may get a warning even with this approach; you can consider it a false positive (i.e., disregard it).


    A more crude, yet unambiguous solution: make a copy!

    filtered = dfi[ dfi['c3'] > 11, : ].copy()

    Obviously this is less than optimal for large datasets.


    More details about .loc are in the next section.


    Using .loc[] to select data by column or row label

    If a slice is to changed, it should be derived using .loc[] rather than slicing.


    Again, starting with this DataFrame:

    dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                       'c2': [5,   6,  7, 8,   9],
                       'c3': [10, 11, 12, 13, 14],
                       'c4': [15, 16, 17, 18, 19],
                       'c5': [20, 21, 22, 23, 24],
                       'c6': [25, 26, 27, 28, 29] },
               index = ['r1', 'r2', 'r3', 'r4', 'r5'])
    
      #     c1  c2  c3  c4  c5  c6
      # r1   0   5  10  15  20  25
      # r2   1   6  11  16  21  26
      # r3   2   7  12  17  22  27
      # r4   3   8  13  18  23  28
      # r5   4   9  14  19  24  29


    Slicing Columns: these examples select all rows and one or more columns.


    Slice a range of columns with a slice of column labels:

    dfi_slice = dfi.loc[:, 'c1': 'c3']
    
      #     c1  c2  c3
      # r1   0   5  10
      # r2   1   6  11
      # r3   2   7  12
      # r4   3   8  13
      # r5   4   9  14

    Note the slice upper bound is inclusive!


    Slice a single column Series with a string column label:

    dfi_slice = dfi.loc[:, 'c3']
    
      # r1    10
      # r2    11
      # r3    12
      # r4    13
      # r5    14
      # Name: c3, dtype: int64
    

    Slice a selection of columns with a tuple of column labels:

    dfi_slice = dfi.loc[:, ('c2', 'c3')]
    
      #    c2  c3
      # r1   5  10
      # r3   7  12
      # r4   8  13
      # r5   9  14

    A tuple is necessary here because it is "hashable" -- a list will not work here.



    Slicing Rows: these examples select one or more rows and all columns.


    Slice a range of rows with a slice of row labels:

    dfi_slice = dfi.loc['r1': 'r3':, :]
    
      #     c1  c2  c3  c4  c5  c6
      # r1   0   5  10  15  20  25
      # r2   1   6  11  16  21  26
      # r3   2   7  12  17  22  27

    Note the slice upper bound is inclusive!


    Slice a single row Series with a string row label:

    dfi_slice = dfi.loc['r2', :]
    
      # c1     1
      # c2     6
      # c3    11
      # c4    16
      # c5    21
      # c6    26
      # Name: r2, dtype: int64

    Slice a selection of rows with a tuple of row labels:

    dfi_slice = dfi.loc[('r1', 'r3', 'r5'), :]
    
      #     c1  c2  c3  c4  c5  c6
      # r1   0   5  10  15  20  25
      # r3   2   7  12  17  22  27
      # r5   4   9  14  19  24  29

    A tuple is necessary here because it is "hashable" -- a list will not work here.



    Slicing Rows and Columns


    We can of course specify both rows and columns:

    dfi.loc['r1': 'r3', 'c1': 'c3']
    
      #     c1  c2  c3
      # r1   0   5  10
      # r2   1   6  11
      # r3   2   7  12

    Using .loc[] to select data by 'mask' criteria

    A conditional can be used with .loc[] to select rows or columns


    Again, starting with this DataFrame:

    dfi = pd.DataFrame({'c1': [0,   1,  2, 3,   4],
                       'c2': [5,   6,  7, 8,   9],
                       'c3': [10, 11, 12, 13, 14],
                       'c4': [15, 16, 17, 18, 19],
                       'c5': [20, 21, 22, 23, 24],
                       'c6': [25, 26, 27, 28, 29] },
               index = ['r1', 'r2', 'r3', 'r4', 'r5'])
    
      #     c1  c2  c3  c4  c5  c6
      # r1   0   5  10  15  20  25
      # r2   1   6  11  16  21  26
      # r3   2   7  12  17  22  27
      # r4   3   8  13  18  23  28
      # r5   4   9  14  19  24  29

    .loc[] can also specify rows or columns based on criteria -- here are all the rows with 'c3' value greater than 11 (and all columns):

    dfislice = dfi.loc[ dfi['c3'] > 11, :]
    
      #     c1  c2  c3  c4  c5  c6
      # r3   2   7  12  17  22  27
      # r4   3   8  13  18  23  28
      # r5   4   9  14  19  24  29

    In order to add or change column values based on a row mask, we can specify which column should change and assign a value to it:


    dfi.loc[ dfi['c3'] > 11, 'c6'] = dfi['c6'] * 100  # 100 * 'c6' value if 'c3' > 11
    print(dfi)
    
       #     c1  c2  c3  c4  c5    c6
       # r1   0   5  10  15  20    25
       # r2   1   6  11  16  21    26
       # r3   2   7  12  17  22  5400
       # r4   3   8  13  18  23  5600
       # r5   4   9  14  19  24  5800

    DataFrame Concatenating / Appending

    pd.concat() is analogous to df.append()


    concat() can join dataframes either horizontally or vertically.

    df = pd.DataFrame( {'a': [1, 2, ],
                        'b': [1.0, 1.5 ] } )
    
    df2 = pd.DataFrame( {'b': [1, 2 ],
                         'c': [1.0, 1.5 ] } )
    df3 = pd.concat([df, df2])
    print(df3)
    
          #      a    b    c
          # 0  1.0  1.0  NaN
          # 1  2.0  1.5  NaN
          # 0  NaN  1.0  1.0
          # 1  NaN  2.0  1.5

    Note that the column labels have been aligned. As a result, some data is seen to be "missing", with the NaN value used (discussed shortly).


    In horizontal concatenation, the row labels are aligned but the column labels may be repeated:

    df4 = pd.concat([df, df2], axis=1)
    print(df4)
    
          #    a    b  b    c
          # 0  1  1.0  1  1.0
          # 1  2  1.5  2  1.5

    DataFrame append() is the method equivalent to pd.concat(), called on a DataFrame:

    df = df.append(df2)            # compare:  pd.concat([df, df2])
    
    df = df3.append(df4, axis=1)   # compare:  pd.concat([df, df2], axis=1)

    We can append a Series but must include the ignore_index=True parameter:

    df = pd.DataFrame( {'a': [1, 2, ],
                        'b': [1.0, 1.5 ] } )
    
    df = df.append(pd.Series(), ignore_index=True)
    
    print(df)
    
           #      a    b
           # 0  1.0  1.0
           # 1  2.0  1.5
           # 2  NaN  NaN

    DataFrame Sorting and Transposing

    Reorder and rotate a DataFrame


    import random
    rdf = pd.DataFrame({'a': [ random.randint(1,5) for i in range(5)],
                        'b': [ random.randint(1,5) for i in range(5)],
                        'c': [ random.randint(1,5) for i in range(5)]})
    print(rdf)
    
        #    a  b  c
        # 0  2  1  4
        # 1  5  3  3
        # 2  1  2  4
        # 3  5  2  4
        # 4  2  4  4
    
    # sorting by a column
    rdf = rdf.sort_values('a')
    print(rdf)
    
        #    a  b  c
        # 2  1  2  4
        # 0  2  1  4
        # 4  2  4  4
        # 1  5  3  3
        # 3  5  2  4
    
    # sorting by a row
    idf = rdf.sort_values(3, axis=1)
    print(idf)
    
        #    b  c  a
        # 0  1  4  2
        # 1  3  3  5
        # 2  2  4  1
        # 3  2  4  5
        # 4  4  4  2
    
    # sorting values by two columns (first by 'c', then by 'b')
    rdf = rdf.sort_values(['c', 'b'])
    print(rdf)
    
        #    a  b  c
        # 1  5  3  3
        # 0  2  1  4
        # 2  1  2  4
        # 3  5  2  4
        # 4  2  4  4
    
    # sorting by index
    rdf = rdf.sort_index()
    print(rdf)
    
        #    a  b  c
        # 0  2  1  4
        # 1  5  3  3
        # 2  1  2  4
        # 3  5  2  4
        # 4  2  4  4
    
    # sorting options:  ascending=False, axis=1
    

    Transposing


    Transposing simply means inverting the x and y axes -- in a sense, flipping the values diagonally:

    rdft = rdf.T
    print(rdft)
    
        #    0  1  2  3  4
        # a  2  5  1  5  2
        # b  1  3  2  2  4
        # c  4  3  4  4  4

    pandas .merge() (DataFrame .join())

    merge() provides database-like joins.


    Merge performs a relational database-like join on two dataframes. We can join on a particular field and the other fields will align accordingly.

    companies = pd.read_excel('company_states.xlsx', sheetname='Companies')
    states = pd.read_excel('company_states.xlsx', sheetname='States')
    
    print(companies)
         #      Company State
         # 0  Microsoft    WA
         # 1      Apple    CA
         # 2        IBM    NY
         # 3     PRTech    PR
    
    print(states)
    
         #   State Abbrev  State Long
         # 0           AZ     Arizona
         # 1           CA  California
         # 2           CO    Colorado
         # 3           NY    New York
         # 4           WA  Washington
    
    cs = pd.merge(companies, states,
                  left_on='State', right_on='State Abbrev')
    
    print(cs)
         #      Company State State Abbrev  State Long
         # 0  Microsoft    WA           WA  Washington
         # 1      Apple    CA           CA  California
         # 2        IBM    NY           NY    New York
         # 3     PRTech    PR          NaN  NaN

    When we merge, you can choose to join on the index (default), or one or more columns. The choices are similar to that in relationship databases:


    Merge method  SQL Join Name     Description
    left          LEFT OUTER JOIN   Use keys from left frame only
    right         RIGHT OUTER JOIN  Use keys from right frame only
    outer         FULL OUTER JOIN   Use union of keys from both frames
    inner         INNER JOIN        Use intersection of keys from both frames

    how= describes the type of join on= designates the column on which to join If the join columns are differently named, we can use left_on= and right_on=


    left join: include only keys from 'left' dataframe. Note that only states from the 'companies' dataframe are included.

    cs = pd.merge(companies, states, how='left',
                  left_on='State', right_on='State Abbrev')
    
    print(cs)
         #      Company State State Abbrev  State Long
         # 0  Microsoft    WA           WA  Washington
         # 1      Apple    CA           CA  California
         # 2        IBM    NY           NY    New York

    (Right join would be the same but with the dfs switched.)


    outer join: include keys from both dataframes. Note that all states are included, and the missing data from 'companies' is shown as NaN

    cs = pd.merge(companies, states, how='outer',
                  left_on='State', right_on='State Abbrev')
    
    print(cs)
         #      Company State State Abbrev  State Long
         # 0  Microsoft    WA           WA  Washington
         # 1      Apple    CA           CA  California
         # 2        IBM    NY           NY    New York
         # 3     PRTech    PR          NaN         NaN
         # 4        NaN   NaN           AZ     Arizona
         # 5        NaN   NaN           CO    Colorado

    inner join: include only keys common to both dataframes. Note taht

    cs = pd.merge(companies, states, how='inner',
                  left_on='State', right_on='State Abbrev')
    
    print(cs)
         #      Company State State Abbrev  State Long
         # 0  Microsoft    WA           WA  Washington
         # 1      Apple    CA           CA  California
         # 2        IBM    NY           NY    New York

    Working with Missing Data (NaN)

    "Not a Number" is numpy's None value.


    If pandas can't insert a value (because indexes are misaligned or for other reasons), it inserts a special value noted as NaN (not a number) in its place. This value belongs to the numpy module, accessed through np.nan


    import numpy as np
    
    df = pd.DataFrame({ 'c1': [6, 6, np.nan],
                        'c2': [np.nan, 1, 3],
                        'c3': [2, 2, 2]  })
    
    print(df)
                      #     c1   c2  c3
                      # 0  6.0  NaN   2
                      # 1  6.0  1.0   2
                      # 2  NaN  3.0   2

    Note that we are specifying the NaN value with np.nan, athough in most cases the value is generated by "holes" in mismatched data.


    We can fill missing data with fillna():

    df2 = df.fillna(0)
    print(df2)
                      #     c1   c2  c3
                      # 0  6.0  0.0   2
                      # 1  6.0  1.0   2
                      # 2  0.0  3.0   2

    Or we can choose to drop rows or columns that have any NaN values with dropna():

    df3 = df.dropna()
    
    print(df3)
                      #     c1   c2  c3
                      # 1  6.0  1.0   2
    
    # axis=1:  drop columns
    df4 = df.dropna(axis=1)
    
    print(df4)
    
                      #    c3
                      # 0   2
                      # 1   2
                      # 2   2

    Testing for NaN We may well be interested in whether a column or row has missing data. .isnull() provides a True/False mapping.


    print(df)
                      #     c1   c2  c3
                      # 0  6.0  NaN   2
                      # 1  6.0  1.0   2
                      # 2  NaN  3.0   2
    
    df['c1'].isnull().any()  # True
    df['c3'].isnull().any()  # False
    
    
    df['c1'].isnull().all()  # False



    pandas part 4: Transformations and Computations

    Vectorized Operations

    Operations to columns are vectorized, meaning they are propagated (broadcast) across all column Series in a DataFrame.


    import pandas as pd
    
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df)
    
                      #     a    b  c
                      # r1  1  1.0  a
                      # r2  2  1.5  b
                      # r3  3  2.0  c
                      # r4  4  2.5  d
    
    
    # 'single value':  assign the same value to all cells in a column Series
    df['a'] = 0       # set all 'a' values to 0
    print(df)
    
                      #     a    b  c
                      # r1  0  1.0  a
                      # r2  0  1.5  b
                      # r3  0  2.0  c
                      # r4  0  2.5  d
    
    
    # 'calculation':  compute a new value for all cells in a column Series
    df['b'] = df['b'] * 2    # double all column 'b' values
    
    print(df)
    
                      #     a    b  c
                      # r1  0  2.0  a
                      # r2  0  3.0  b
                      # r3  0  4.0  c
                      # r4  0  5.0  d

    Adding New Columns with Vectorized Values

    We can also add a new column to the Dataframe based on values or computations:

    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [2.0, 3.0, 4.0, 5.0],
                        'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    
    df['d'] = 3.14           # new column, each field set to same value
    
    print(df)
    
                      #     a    b  c  d
                      # r1  1  2.0  a  3.14
                      # r2  2  3.0  b  3.14
                      # r3  3  4.0  c  3.14
                      # r4  4  5.0  d  3.14
    
    
    df['e'] = df['a'] + df['b']    # vectorized computation to new column
    
    print(df)
    
                      #     a    b  c     d  e
                      # r1  1  2.0  a  3.14  3.0
                      # r2  2  3.0  b  3.14  5.0
                      # r3  3  4.0  c  3.14  7.0
                      # r4  4  5.0  d  3.14  9.0

    Aggregate methods for DataFrame and Series

    Methods .sum(), .cumsum(), .count(), .min(), .max(), .mean(), .median(), et al. provide summary operations


    import numpy as np
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, np.nan, 2.5],
                        'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df.sum())
    
         # a      10
         # b       5
         # c    abba
         # dtype: object
    
    
    print(df.cumsum())
    
         #      a    b     c
         # r1   1    1     a
         # r2   3  2.5    ab
         # r3   6  NaN   abb
         # r4  10    5  abba
    
    
    print(df.count())
    
         # a    4
         # b    3
         # c    4
         # dtype: int64

    Most of these methods work on a Series object as well:

    print(df['a'].median())
    
    2.5

    To see a list of attributes for any object, use dir() with a DataFrame or Series object. This is best done in jupyter notebook:

    dir(df['a'])     # attributes for Series

    The list of attributes is long, but this kind of exploration can provide some useful surprises.


    DataFrame groupby()

    A groupby operation performs the same type of operation as the database GROUP BY. Grouping rows of the table by the value in a particular column, you can perform aggregate sums, counts or custom aggregations.


    This simple hypothetical table shows client names, regions, revenue values and type of revenue.


    df = pd.DataFrame( { 'company': ['Alpha', 'ALPHA', 'ALPHA', 'BETA', 'Beta', 'Beta', 'Gamma', 'Gamma', 'Gamma'],
                         'region':  ['NE', 'NW', 'SW', 'NW', 'SW', 'NE', 'NE', 'SW', 'NW'],
                         'revenue': [10, 9, 2, 15, 8, 2, 16, 3, 9],
                         'revtype': ['retail', 'retail', 'wholesale', 'wholesale', 'wholesale',
                                     'retail', 'wholesale', 'retail', 'retail'] } )
    
    print(df)
    
                      #   company region  revenue    revtype
                      # 0   Alpha     NE       10     retail
                      # 1   Alpha     NW        9     retail
                      # 2   Alpha     SW        2  wholesale
                      # 3    Beta     NW       15  wholesale
                      # 4    Beta     SW        8  wholesale
                      # 5    Beta     NE        2     retail
                      # 6   Gamma     NE       16  wholesale
                      # 7   Gamma     SW        3     retail
                      # 8   Gamma     NW        9     retail

    groupby() built-in Aggregate Functions

    The "summary functions" like sum() count()


    Aggregations are provided by the DataFrame groupby() method, which returns a special groupby object. If we'd like to see revenue aggregated by region, we can simply select the column to aggregate and call an aggregation function on this object:


    # revenue sum by region
    rsbyr = df.groupby('region').sum()   # call sum() on the groupby object
    print(rsbyr)
    
                      #         revenue
                      # region
                      # NE           28
                      # NW           33
                      # SW           13
    
    
    # revenue average by region
    rabyr = df.groupby('region').mean()
    print(rabyr)
    
                      #           revenue
                      # region
                      # NE       9.333333
                      # NW      11.000000
                      # SW       4.333333

    The result is dataframe with the 'region' as the index and 'revenue' as the sole column. Note that although we didn't specify the revenue column, pandas noticed that the other columns were not numbers and therefore should not be included in a sum or mean. If we ask for a count, python counts each column (which will be the same for each). So if we'd like the analysis to be limited to one or more coluns, we can simply slice the dataframe first:


    # count of all columns by region
    print(df.groupby('region').count())
    
                      #         company  revenue  revtype
                      # region
                      # NE            3        3        3
                      # NW            3        3        3
                      # SW            3        3        3
    
    
    # count of companies by region
    dfcr = df[['company', 'region']]       # dataframe slice:  only 'company' and 'region'
    print(dfcr.groupby('region').count())
    
                      #         company
                      # region
                      # NE            3
                      # NW            3
                      # SW            3

    List of selected built-in groupby functions


                   count()
                   mean()
                   sum()
                   min()
                   max()
                   describe() (prints out several columns including sum, mean, min, max)



    pandas part 5: Advanced groupby(), apply(), TimeSeries, Binning and Categorizing

    Series.apply(): apply a function call across a vector

    The function is called with each value in a row or column.


    Sometimes our computation is more complex than simple math, or we need to apply a function to each element. We can use apply():

    import pandas as pd
    
    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
    
    print(df)
    
                      #     a    b  c
                      # r1  1  1.0  a
                      # r2  2  1.5  b
                      # r3  3  2.0  c
                      # r4  4  2.5  d
    
    
    df['d'] = df['c'].apply(str.upper)
    
    print(df)
                      #     a    b  c  d
                      # r1  1  1.0  a  A
                      # r2  2  1.5  b  B
                      # r3  3  2.0  c  C
                      # r4  4  2.5  d  D

    apply() with custom function or lambda

    We use a custom named function or a lambda with apply():


    print(df)
                      #     a    b  c  d
                      # r1  1  1.0  a  A
                      # r2  2  1.5  b  B
                      # r3  3  2.0  c  C
                      # r4  4  2.5  d  D
    
    
    df['e'] = df['a'].apply(lambda x: '$' + str(x * 1000) )
    
    print(df)
                      #     a    b  c  d      e
                      # r1  1  1.0  a  A  $1000
                      # r2  2  1.5  b  B  $2000
                      # r3  3  2.0  c  C  $3000
                      # r4  4  2.5  d  D  $4000

    See below for an explanation of lambda.


    Review: lambda expressions

    A lambda describes a function in shorthand.


    Compare these two functions, both of which add/concatenate their arguments:

    def addthese(x, y):
      return x + y
    
    addthese2 = lambda x, y:  x + y
    
    print(addthese(5, 9))        # 14
    print(addthese2(5, 9))       # 14

    The function definition and the lambda statement are equivalent - they both produce a function with the same functionality.


    Advanced groupby(): multi-column aggregation

    Calculating a sum or count based on values in 2 or more columns.


    To aggregate by values in two combined columns, simply pass a list of columns by which to aggregate -- the result is called a "multi-column aggregation":


    print(df.groupby(['region', 'revtype']).sum())
    
                      #                   revenue
                      # region revtype
                      # NE     retail          12
                      #        wholesale       16
                      # NW     retail          18
                      #        wholesale       15
                      # SW     retail           3
                      #        wholesale       10

    Note that the index has 2 columns (you can tell in that the tops of the columns are 'recessed' beneath the column row). This is a MultiIndex or hierarchical index. In the above example the NE stands over both retail and wholesale in the first 2 rows -- we should read this as NE-retail and NE-wholesale.


    grouping functions: use a custom summary function

    Like passing a function to sorted(), we can pass a function to df.groupby()


    df = pd.DataFrame( { 'company': [ 'Alpha', 'Alpha', 'Alpha',
                                      'Beta', 'Beta', 'Beta',
                                      'Gamma', 'Gamma', 'Gamma'],
                         'region':  [ 'NE', 'NW', 'SW', 'NW', 'SW',
                                      'NE', 'NE', 'SW', 'NW'],
                         'revenue': [ 10, 9, 2, 15, 8, 2, 16, 3, 9],
                         'revtype': [ 'retail', 'retail', 'wholesale',
                                      'wholesale', 'wholesale', 'retail',
                                      'wholesale', 'retail', 'retail'     ] } )
    
    print(df)
    
                      #   company region  revenue    revtype
                      # 0   Alpha     NE       10     retail
                      # 1   Alpha     NW        9     retail
                      # 2   Alpha     SW        2  wholesale
                      # 3    Beta     NW       15  wholesale
                      # 4    Beta     SW        8  wholesale
                      # 5    Beta     NE        2     retail
                      # 6   Gamma     NE       16  wholesale
                      # 7   Gamma     SW        3     retail
                      # 8   Gamma     NW        9     retail

    groupby() functions using apply() We can design our own custom functions -- we simply use apply() and pass a function (you might remember similarly passing a function from the key= argument to sorted()). Here is the equivalent of the sum() function, written as a custom function:


    def get_sum(df_slice):
        return sum(df_slice['revenue'])
    
    print(df.groupby('region').apply(get_sum))  # custom function: same as groupby('region').sum()
    
                      # region
                      # NE    28
                      # NW    33
                      # SW    13
                      # dtype: int64

    As was done with sorted(), pandas calls our groupby function multiple times, once with each group. The argument that Python passes to our custom function is a dataframe slice containing just the rows from a single grouping -- in this case, a specific region (i.e., it will be called once with a silce of NE rows, once with NW rows, etc. The function should be made to return the desired value for that slice -- in this case, we want to see the sum of the revenue column (as mentioned, this is simply illustrating a function that does the same work as the built-in .sum() function). (For a better view on what is happening with the function, print df_slice inside the function -- you will see the values in each slice printed.) Here is a custom function that returns the median ("middle value") for each region:


    def get_median(df):
        listvals = sorted(list(df['revenue']))
        lenvals = len(listvals)
        midval = listvals[ int(lenvals / 2) ]
        return midval
    
    print(df.groupby('region').apply(get_median))
    
                      # region
                      # NE    10
                      # NW     9
                      # SW     3
                      # dtype: int64

    grouping functions: use a function to identify a group

    Standared aggregations group rows based on a column value ('NW', 'SW', etc.) or a combination of column values. If more work is needed to identify a group, we can supply a custom function for this operation as well. Perhaps we'd like to group our rows by whether or not they achieved a certain revenue target within a region. Basically we want to group each row by whether the value is 10 or greater (i.e., 10 or more for a company/region/revenue type). Our function will simply return the number of decimal places in the value. So, we can process this column value (or even include other column values) by referencing a function in the call to groupby():


    def bydecplace(idx):
        row = df.loc[idx]                 # a Series with the row values for this index
        return(len(str(row['revenue'])))  # '2' if 10; '1' if 9
    
    print((df.groupby(bydecplace).sum()))
                      #      revenue
                      #   1       33
                      #   2       41

    The value passed to the function is the index of a row. We can thus use the .loc attribute with the index value to access the row. This function isolates the revenue within the row and returns its string length. using lambdas as groupby() or grouping functions Of course any of these simple functions can be rewritten as a lambda (and in many cases, should be, as in the above case since the function references the dataframe directly, and we should prefer not to refer to outside variables in a standard function):


    def bydecplace(idx):
        row = df.loc[idx]                 # a Series with the row values for this index
        return(len(str(row['revenue'])))  # '2' if 10; '1' if 9
    
    print((df.groupby(lambda idx:  len(str(df.loc[idx]['revenue']))).sum()))
    
                      #        revenue
                      # alpha       21
                      # beta        25
                      # gamma       28

    Review: the Index -- DataFrame Column or Index Labels

    An Index object is used to specify a DataFrame's columns or index, or a Series' index.


    Columns and Indices


    A DataFrame makes use of two Index objects: one to represent the columns, and one to represent the rows.

    df = pd.DataFrame( {'a': [1, 2, 3, 4],
                        'b': [1.0, 1.5, 2.0, 2.5],
                        'c': ['a', 'b', 'c', 'd'],
                        'd': [100, 200, 300, 400] },
                        index=['r1', 'r2', 'r3', 'r4'] )
    print(df)
        #     a    b  c    d
        # r1  1  1.0  a  100
        # r2  2  1.5  b  200
        # r3  3  2.0  c  300
        # r4  4  2.5  d  400

    .rename() method: columns or index labels can be reset using this DataFrame method.

    df = df.rename(columns={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'},
                   index={'r1': 'R1', 'r2': 'R2', 'r3': 'R3', 'r4': 'R4'})
    print(df)
        #     A    B  C    D
        # R1  1  1.0  a  100
        # R2  2  1.5  b  200
        # R3  3  2.0  c  300
        # R4  4  2.5  d  400

    .columns, .index: the columns or index can also be set directly using the DataFrame's attributes (this would have the same effect as above):

    df.columns = ['A', 'B', 'C', 'D']
    df.index = ['r1', 'r2', 'r3', 'r4']

    .set_index(): set any column to the index

    df2 = df.set_index('A')
    print(df2)
        #      B  C    D
        # A
        # 1  1.0  a  100
        # 2  1.5  b  200
        # 3  2.0  c  300
        # 4  2.5  d  400

    .reset_index(): we can reset the index to integers starting from 0; by default this converts the previous into a new column:

    df3 = df.reset_index()
    print(df3)
        #   index  A    B  C    D
        # 0    R1  1  1.0  a  100
        # 1    R2  2  1.5  b  200
        # 2    R3  3  2.0  c  300
        # 3    R4  4  2.5  d  400

    or to drop the index when resetting, include drop=True

    df4 = df.reset_index(drop=True)
    print(df4)
        #    A    B  C    D
        # 0  1  1.0  a  100
        # 1  2  1.5  b  200
        # 2  3  2.0  c  300
        # 3  4  2.5  d  400

    .reindex(): we can change the order of the indices and thus the rows:

    df5 = df.reindex(reversed(df.index))
    
    df5 = df5.reindex(columns=reversed(df.columns))
    
    print(df5)
    
                          #  state  A    B  C    D
                          #  year
                          #  R1     1  1.0  a  100
                          #  R2     2  1.5  b  200
                          #  R3     3  2.0  c  300
                          #  R4     4  2.5  d  400

    we can set names for index and column indices:

    df.index.name = 'year'
    df.columns.name = 'state'

    There are a number of "exotic" Index object types: Index (standard, default and most common Index type) RangeIndex (index built from an integer range) Int64Index, UInt64Index, Float64Index (index values of specific types) DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex (datetime-related indices) CategoricalIndex (index related to the Categorical type)


    The MultiIndex: a sequence of tuples

    In a MultiIndex, we can think of a column or row label as having two items.


    A MultiIndex specifies an "index within an index" or "column within a column" for more sophisticated labeling of data.


    A DataFrame with multi-index columns this and that and multi-index index other and another

    this                  a                   b
    that                  1         2         1         2
    other another
    x     1       -1.618192  1.040778  0.191557 -0.698187
          2        0.924018  0.517304  0.518304 -0.441154
    y     1       -0.002322 -0.157659 -0.169507 -1.088491
          2        0.216550  1.428587  1.155101 -1.610666

    The MultiIndex can be generated by a multi-column aggregation, or it can be set directly, as below.


    The from_tuples() method creates a MultiIndex from tuple pairs that represent levels of the MultiIndex:

    arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
              ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
    
    tuples = list(zip(*arrays))    # zip two lists like a zipper
    
    #  [('bar', 'one'),
    #   ('bar', 'two'),
    #   ('baz', 'one'),
    #   ('baz', 'two'),
    #   ('foo', 'one'),
    #   ('foo', 'two'),
    #   ('qux', 'one'),
    #   ('qux', 'two')]
    
    
    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
    
    #   MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
    #              labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
    #              names=['first', 'second'])

    The notation above is somewhat hard to read; the labels= parameter specifies which of the two levels= lists values appears in each tuple pair.


    Here we're applying the above index to a Series object:

    s = pd.Series(np.random.randn(8), index=index)
    
    #   first  second
    #   bar    one       0.469112
    #          two      -0.282863
    #   baz    one      -1.509059
    #          two      -1.135632
    #   foo    one       1.212112
    #          two      -0.173215
    #   qux    one       0.119209
    #          two      -1.044236
    #   dtype: float64

    Slicing a MultiIndex

    Slicing works more or less as expected; tuples help us specify Multilevel indices.


    mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
                           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
                           names=['first', 'second'])
    
    df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)
    
    #                        A         B         C         D
    #   first second
    #   bar   one    -0.231171  0.340523  0.472207 -0.543819
    #         two     0.113923  0.367657  0.171424 -0.039921
    #   baz   one    -0.625282 -0.791371 -0.487958  0.568405
    #         two    -1.128698 -1.040629  2.536821 -0.844057
    #   foo   one    -1.319797 -1.277551 -0.614919  1.305367
    #         two     0.414166 -0.427726  0.929567 -0.524161
    #   qux   one     1.859414 -0.190417 -1.824712  0.454862
    #         two    -0.169519 -0.850846 -0.444302 -0.577360

    standard slicing

    df['A']
    #   first  second
    #   bar    one      -0.231171
    #          two       0.113923
    #   baz    one      -0.625282
    #          two      -1.128698
    #   foo    one      -1.319797
    #          two       0.414166
    #   qux    one       1.859414
    #          two      -0.169519
    #   Name: A, dtype: float64
    
    
    df.loc['bar']
    #                  A         B         C         D
    #   second
    #   one    -0.231171  0.340523  0.472207 -0.543819
    #   two     0.113923  0.367657  0.171424 -0.039921
    
    
    df.loc[('bar', 'one')]             # also:  df.loc['bar'].loc['one']
    #   A   -0.231171
    #   B    0.340523
    #   C    0.472207
    #   D   -0.543819
    #   Name: (bar, one), dtype: float64
    
    
    df.loc[('bar', 'two'), 'A']
    #   0.11392342023306047

    'cross-section' slicing with .xs

    The 'level' parameter allows slicing along a lower level


    mindex = pd.MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
                           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
                           names=['first', 'second'])
    
    df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'], index=mindex)
    
    #                        A         B         C         D
    #   first second
    #   bar   one    -0.231171  0.340523  0.472207 -0.543819
    #         two     0.113923  0.367657  0.171424 -0.039921
    #   baz   one    -0.625282 -0.791371 -0.487958  0.568405
    #         two    -1.128698 -1.040629  2.536821 -0.844057
    #   foo   one    -1.319797 -1.277551 -0.614919  1.305367
    #         two     0.414166 -0.427726  0.929567 -0.524161
    #   qux   one     1.859414 -0.190417 -1.824712  0.454862
    #         two    -0.169519 -0.850846 -0.444302 -0.577360
    
    
    # standard slicing
    df.xs('bar')
    #                  A         B         C         D
    #   second
    #   one    -0.231171  0.340523  0.472207 -0.543819
    #   two     0.113923  0.367657  0.171424 -0.039921
    
    df.xs(('baz', 'two'))
    #   A   -1.128698
    #   B   -1.040629
    #   C    2.536821
    #   D   -0.844057
    #   Name: (baz, two), dtype: float64
    
    
    # using the level= parameter
    df.xs('two', level='second')
    #                 A         B         C         D
    #   first
    #   bar    0.113923  0.367657  0.171424 -0.039921
    #   baz   -1.128698 -1.040629  2.536821 -0.844057
    #   foo    0.414166 -0.427726  0.929567 -0.524161
    #   qux   -0.169519 -0.850846 -0.444302 -0.577360

    TimeSeries: objects and methods

    These custom pandas objects provide powerful date calculation and generation.


    Timestamp: a single timestamp representing a date/time Timedelta: a date/time interval (like 1 months, 5 days or 2 hours) Period: a particular date span (like 4/1/16 - 4/3/16 or 4Q17) DatetimeIndex: DataFrame or Series Index of Timestamp PeriodIndex: DataFrame or Series Index of Period Timestamp: a single point in time


    Timestamp() constructor: creating a Timestamp object from string, ints or datetime():

    tstmp = pd.Timestamp('2012-05-01')
    tstmp = pd.Timestamp(2012, 5, 1)
    tstmp = pd.Timestamp(datetime.datetime(2012, 5, 1))
    
    year  = tstmp.year    # 2012
    month = tstmp.month   # 5
    day   = tstmp.day     # 1

    .to_datetime(): convert a string, list of strings or Series to dates

    tseries = pd.to_datetime(['2005/11/23', '2010.12.31'])
       # DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)
    
    tseries = pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
    
    # using European dates
    tstmp = pd.to_datetime('11/12/2010', dayfirst=True)   # 2010-11-12


    Timedelta: a time interval


    Timedelta() constructor: creating an interval

    # strings
    td = pd.Timedelta('1 days')                 # Timedelta('1 days 00:00:00')
    td =  pd.Timedelta('1 days 00:00:00')       # Timedelta('1 days 00:00:00')
    td = pd.Timedelta('1 days 2 hours')         # Timedelta('1 days 02:00:00')
    td = pd.Timedelta('-1 days 2 min 3us')      # Timedelta('-2 days +23:57:59.999997')
    
    # negative Timedeltas
    td = pd.Timedelta('-1us')                   # Timedelta('-1 days +23:59:59.999999')
    
    
    # with args similar to datetime.timedelta
    # note: these MUST be specified as keyword arguments
    td = pd.Timedelta(days=1, seconds=1)        # Timedelta('1 days 00:00:01')
    
    
    # integers with a unit
    td = pd.Timedelta(1, unit='d')              # Timedelta('1 days 00:00:00')


    Period: a specific datetime->datetime interval


    Period constructor: creating a date-to-date timespan

    perimon = pd.Period('2011-01')               # default interval is 'month' (end time is 2011-01-31 23:59:59.999)
    periday = pd.Period('2012-05-01', freq='D')  # specify 'daily' (end datetime is 2012-05-01 23:59:99.999)

    Filtering / Selecting Dates

    Let's start with data as it might come from a CSV file. We've designed the date column to be the DataFrame's index:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame( {'impressions': [9,    10,   8,    3,    7,    12    ],
                        'sales':       [2.03, 2.38, 1.93, 0.63, 1.85, 2.53  ],
                        'clients':     [4,    6,    5,    1,    5,    7     ]  },
                        index=[ '2016-11-15', '2016-12-01', '2016-12-15',
                                '2017-01-01', '2017-01-15', '2017-02-01' ] )
    
    print(df)
                      #             clients  impressions  sales
                      # 2016-11-15        4            9   2.03
                      # 2016-12-01        6           10   2.38
                      # 2016-12-15        5            8   1.93
                      # 2017-01-01        1            3   0.63
                      # 2017-01-15        5            7   1.85
                      # 2017-02-01        7           12   2.53
    print(type(df.index[0]))                           # <class 'str'>

    Note that the index is listed as string. This would be standard in a read from a plaintext format like CSV (although not from a date-formatted column in Excel)


    We can convert the strings to Timestamp with astype():

    df.index = df.index.astype(np.datetime64)
    
    print(type(df.index))                      # <class 'pandas.tseries.index.DatetimeIndex'>
    print(type(df.index[0]))                   # <class 'pandas.tslib.Timestamp'>

    Now the index is a DatetimeIndex (no longer an Index), consisting of Timestamp objects, optimized for date calculation and selection.


    Filtering: a Series or DatetimeIndex of np.Timestamp objects, they can be selected or filtered quite easily:

    rng = pd.date_range('1/1/2016', periods=24, freq='M')
    
    
    # all entries from 2016
    print(df['2016'])
    
                      #             clients  impressions  sales
                      # 2016-11-15        4            9   2.03
                      # 2016-12-01        6           10   2.38
                      # 2016-12-15        5            8   1.93
    
    
    # all entries from Dec. 2016
    print(df['2016-12'])
    
                      #             clients  impressions  sales
                      # 2016-12-01        6           10   2.38
                      # 2016-12-15        5            8   1.93
    
    
    # all entries from Dec. 2016
    print(df['2016-12-10':])
    
                      #             clients  impressions  sales
                      # 2016-12-15        5            8   1.93
                      # 2017-01-01        1            3   0.63
                      # 2017-01-15        5            7   1.85
                      # 2017-02-01        7           12   2.53
    
    
    # all entries from 12/10/16 - 1/10/17
    print(df['2016-12-10': '2017-01-10'])
    
                      #             clients  impressions  sales
                      # 2016-12-15        5            8   1.93
                      # 2017-01-01        1            3   0.63

    Creating, comparing and calculating dates with pd.Timedelta

    We add or subtract a Timedelta interval from a Timestamp


    Comparing Timestamps

    ts1 = pd.Timestamp('2011-07-09 11:30')
    ts2 = pd.Timestamp('2011-07-10 11:35')
    
    print(ts1 > ts2)            # False
    print(ts1 < ts2)            # True

    Computing Timedeltas

    td1 = ts2 - ts1
    print(td1)                 # 1 days 00:05:00
    print((type(td1)))           # 
    
    # values in a Timedelta boil down to days and seconds
    print(td.days)              # 1
    print(td.seconds)           # 300
    
    ts3 = ts2 + td              # adding 1 day and 5 minutes
    print(ts3)                  # Timestamp('2011-07-11 11:40:00')

    Creating Timedeltas

    # strings
    pd.Timedelta('1 days')                 # Timedelta('1 days 00:00:00')
    
    pd.Timedelta('1 days 00:00:00')        # Timedelta('1 days 00:00:00')
    
    pd.Timedelta('1 days 2 hours')         # Timedelta('1 days 02:00:00')
    
    pd.Timedelta('-1 days 2 min 3us')      # Timedelta('-2 days +23:57:59.999997')
    
    # like datetime.timedelta
    # note: these MUST be specified as keyword arguments
    pd.Timedelta(days=1, seconds=1)        # Timedelta('1 days 00:00:01')
    
    # integers with a unit
    pd.Timedelta(1, unit='d')              # Timedelta('1 days 00:00:00')
    
    # from a datetime.timedelta/np.timedelta64
    pd.Timedelta(datetime.timedelta(days=1, seconds=1))
                                           # Timedelta('1 days 00:00:01')
    
    pd.Timedelta(np.timedelta64(1, 'ms'))  # Timedelta('0 days 00:00:00.001000')
    
    # negative Timedeltas
    pd.Timedelta('-1us')                   # Timedelta('-1 days +23:59:59.999999')

    Generating a date range with pd.date_range()

    date_range() provides evenly spaced Timestamp objects.


    date_range() with a start date, periods= and freq=:

    # By default date_range() returns a DatetimeIndex.
    # 5 hours starting with midnight Jan 1st, 2011
    rng = pd.date_range('1/1/2011', periods=5, freq='H')
    print(rng)
            # DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
            #                '2011-01-01 02:00:00', '2011-01-01 03:00:00',
            #                '2011-01-01 04:00:00'],
            #                dtype='datetime64[ns]', freq='H')
    
    
    ts = pd.Series(list(range(0, len(rng))), index=rng)
    
    print(ts)
        # 2011-01-01 00:00:00    0
        # 2011-01-01 01:00:00    1
        # 2011-01-01 02:00:00    2
        # 2011-01-01 03:00:00    3
        # 2011-01-01 04:00:00    4
        # Freq: H, dtype: int64

    date_range() with a start date and end date

    start = pd.Timestamp('1/1/2011')
    end =  pd.Timestamp('1/5/2011')
    tindex = pd.date_range(start, end)
    
    print(tindex)
      # DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03',
      #                '2011-01-04', '2011-01-05'])
      #               dtype='datetime64[ns]', length=5, freq='D')
    
      # note default frequency:  'D' (days)
    

    date_range() with a monthly period, dates are set to end of the month:

    tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='M')

    date_range() with a monthly period, dates are set to start of the month:

    tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='MS')

    date_range() with a start date, periods and freq

    tindex = pd.date_range('1/1/2011', periods=3, freq='W')
    
    print(tindex)
        # DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16'],
        #               dtype='datetime64[ns]', freq='W-SUN')

    Note that freq= has defaulted to W-SUN which indicates weekly beginning on Sunday. pandas even adjusted our first day on this basis! We can specify the day of the week ourselves to start on a precise date.


    bdate_range() provides a date range that includes "business days" only:

    tbindex = pd.bdate_range(start, end)
    
    print(tbindex)
      # DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05'],
      #               dtype='datetime64[ns]', freq='B')
    
      # (the 1st and 2nd of Jan. 2011 are Saturday and Sunday)

    See the offset aliases portion of the documentation.


    Comparing dates within intervals with pd.Period

    The Period represents an interval with a start date/time


    The .end_time attribute value is calulated as the start date/time + freq= value.

    # a 'day' period
    per = pd.Period('2016-05-03')    # Period('2016-05-03', 'D')
    
    print(per.start_time)            # Timestamp('2016-05-03 00:00:00')
    print(per.end_time)              # Timestamp('2016-05-03 23:59:59.999999999')
    
    # a 'month' period
    pdfm = pd.Period('2016-05-03', freq='M')
    
    print(pdfm.start_time)           # Timestamp('2016-05-01 00:00:00')
    
    print(pdfm.end_time)             # Timestamp('2016-05-31 23:59:59.999999999')

    "frequency" (or freq=) is a bat of a misnomer. It describes the size of the period -- that is, the amount of time it covers. Thus a freq='M' (month) period ends a month later than the start date.


    The Period object can be incremented to produce a new Period object. The freq interval determines the start date/time and size of the next Period.

    # a 'month' period
    pdfm = pd.Period('2016-05-03', freq='M')
    
    pdfm2 = pdfm + 1
    
    print(pdfm2.start_time)           # Timestamp('2016-06-01 00:00:00')
    print(pdfm2.end_time)             # Timestamp('2016-06-30 23:59:59.999999999')

    period_range(): produce a range of Period objects


    ps = pd.Series(list(range(12)), pd.period_range('1/2017', '12/2017', freq='M'))
    
    print(ps)
        # 2017-01     0
        # 2017-02     1
        # 2017-03     2
        # 2017-04     3
        # 2017-05     4
        # 2017-06     5
        # 2017-07     6
        # 2017-08     7
        # 2017-09     8
        # 2017-10     9
        # 2017-11    10
        # 2017-12    11
        # Freq: M, dtype: int64
    

    Above we have an index of Period objects; each period represents a monthly interval.


    This differs from TimeStamp in that a comparison or selection (such as a slice) will include any value that falls within the requested period, even if the date range is partial:

    print(ps['2017-03-15': '2017-06-15'])
    
        # 2017-03    2
        # 2017-04    3
        # 2017-05    4
        # 2017-06    5
        # Freq: M, dtype: int64

    Note that both 03 and 06 were included in the results, because the slice fell between their ranges.


    Quarterly Period Range

    prng = pd.period_range('1990Q1', '2000Q4', freq='Q-JAN')
    
    sq = pd.Series(list(range(0, len(prng))), prng)
    print(sq)
    
        # 1990Q1    0
        # 1990Q2    1
        # 1990Q3    2
        # 1990Q4    3
        # 1991Q1    4
        # 1991Q2    5
        # 1991Q3    6
        # 1991Q4    7
        # Freq: Q-JAN, dtype: int64
    
    sq[pd.Timestamp('1990-02-13')]    # 4

    Binning

    Dividing values into bins based on a category scheme


    Bins allow us to categorize values (often dates) into "bins" which are mapped to a value to be applied. Consider the table below, which might come from an Excel spreadsheet:


    dfbin = pd.DataFrame({'start_date': [1, 6, 11, 16],
                          'end_date': [5, 10, 15, 20],
                          'percent': [1, 2, 3, 10]})
    
    # order the columns
    dfbin = dfbin[['start_date', 'end_date', 'percent']]
    
    print(dfbin)
             #    start_date  end_date  percent
             # 0           1         5        1
             # 1           6        10        2
             # 2          11        15        3
             # 3          16        20       10

    Any date from 1-5 should key to 1%; any from 6-10, 2%, etc.


    We have data that needs to be categorized into the above bins:

    data = pd.DataFrame({'period': list(range(1, 21))})
    
    print(data)
             #         period
             #     0        1
             #     1        2
             #     2        3
             #     3        4
             #     4        5
             #     5        6
             #     6        7
             #     7        8
             #     8        9
             #     9       10
             #     10      11
             #     11      12
             #     12      13
             #     13      14
             #     14      15
             #     15      16
             #     16      17
             #     17      18
             #     18      19
             #     19      20
    
    
    print(dfbin)
             #    start_date  end_date  percent
             # 0           1         5        1
             # 1           6        10        2
             # 2          11        15        3
             # 3          16        20       10
    
    # converting the 'start_date' field into a list
    bins = list(dfbin['start_date'])
    
    # adding the last 'end_date' value to the end
    bins.append(dfbin.loc[len(dfbin)-1, 'end_date']+1)
    
    # category labels (which can be strings, but here are integers)
    cats = list(range(1, len(bins)))
    
    print(bins)
    print(cats)
             # [1, 6, 11, 16, 21]
             # [1, 2, 3, 4, 5]

    The cut function takes the data, bins and labels and sorts them by bin value:

    # 'right=False' keeps bins from overlapping (the bin does not include the rightmost edge)
    data['cat'] = pd.cut(data['period'], bins, labels=cats, right=False)
    print(data)
    
             #         period cat
             #     0        1   1
             #     1        2   1
             #     2        3   1
             #     3        4   1
             #     4        5   1
             #     5        6   2
             #     6        7   2
             #     7        8   2
             #     8        9   2
             #     9       10   2
             #     10      11   3
             #     11      12   3
             #     12      13   3
             #     13      14   3
             #     14      15   3
             #     15      16   4
             #     16      17   4
             #     17      18   4
             #     18      19   4
             #     19      20   4

    We are now free to use the bin mapping to apply the proper pct value to each row.




    Matplotlib

    Matplotlib documentation

    (my slides are the clearest though)


    Matplotlib documentation can be found here: http://matplotlib.org/ A very good rundown of features is in the Python for Data Analysis 2nd Edition PDF, Chapter 9 A clear tutorial on the central plotting function pyplot (part of which was used for this presentation) can be found here: https://matplotlib.org/users/pyplot_tutorial.html


    Plotting in a Python Script

    Use plt.plot() to plot; plt.savefig() to save as an image file.


    Python script using pyplot object:

    import matplotlib.pyplot as plt
    import numpy as np
    
    line_1_data = [1, 2, 3, 2, 4, 3, 5, 4, 6]
    line_2_data = [6, 4, 5, 3, 4, 2, 3, 2, 1]
    
    plt.plot(line_1_data)          # plot 1st line
    plt.plot(line_2_data)          # plot 2nd line
    
    plt.savefig('linechart.png')   # use any image extension
                                   # for an image of that type

    Plotting in a Jupyter Notebook

    Jupyter notebook session using pyplot object

    # load matplotlib visualization functionality
    %matplotlib notebook
    import matplotlib.pyplot as plt
    
    linedata = np.random.randn(1000).cumsum()
    
    plt.plot(linedata)

    Any calls to .plot() will display the figure in Jupyter.


    The Figure and Subplot Objects

    The figure represents the overall image; a figure may contain multiple subplots.


    Here we are establishing a figure with one subplot. The subplot object ax can be used to plot


    import numpy as np
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)            # this figure will have one subplot
                                             # 1 row, 1 column, position 1 within that
    ax.plot(np.random.randn(1000).cumsum())
    ax.plot(np.random.randn(1000).cumsum())

    Multiple Subplots Within a Figure

    We may create a column of 3 plots, a 2x2 grid of 4 plots, etc.


    import matplotlib.pyplot as plt
    import numpy as np
    
    fig = plt.figure()
    ax1 = fig.add_subplot(2, 2, 1)
    plt.plot(np.random.randn(50).cumsum(), 'k--')
    ax2 = fig.add_subplot(2, 2, 2)
    plt.plot(np.random.randn(50).cumsum(), 'b-')
    ax3 = fig.add_subplot(2, 2, 3)
    plt.plot(np.random.randn(50).cumsum(), 'r.')

    Establishing a grid of subplots with pyplot.subplots()


    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(2, 3)
    
    print(axes)
       # array([ [ <matplotlib.axes._subplots.AxesSubplot object at 0x7fb626374048>,
       #           <matplotlib.axes._subplots.AxesSubplot object at 0x7fb62625db00>,
       #           <matplotlib.axes._subplots.AxesSubplot object at 0x7fb6262f6c88> ],
       #         [ <matplotlib.axes._subplots.AxesSubplot object at 0x7fb6261a36a0>,
       #           <matplotlib.axes._subplots.AxesSubplot object at 0x7fb626181860>,
       #           <matplotlib.axes._subplots.AxesSubplot object at 0x7fb6260fd4e0> ] ],
       #           dtype=object)

    A fine discussion can be found at http://www.labri.fr/perso/nrougier/teaching/matplotlib/matplotlib.html#figures-subplots-axes-and-ticks


    Line Plotting along 1 or 2 axes

    plot() with a list plots the values along the y axis, indexed by list indices along the x axis (0-4):

    fig = plt.figure()
    ax1 = fig.add_subplot(1, 1, 1)
    ax1.plot([1, 2, 3, 4])         # indexed against 0, 1, 2, 3 on x axis

    With two lists, plots the values in the first list along the y axis, indexed by the second list along the x axis:

    fig = plt.figure()
    ax1 = fig.add_subplot(1, 1, 1)
    ax1.plot([10, 20, 30, 40], [1, 4, 9, 16])

    Adding Seaborn to any Plot

    Simply import seaborn makes plots look better.


    import seaborn as sns
    import matplotlib.pyplot as plt
    
    barvals = [10, 30, 20, 40, 30, 50]
    barpos = [0, 1, 2, 3, 4, 5]
    
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.bar(barpos, barvals)

    seaborn is an add-on library that can make any matplotlib plot more attractive, through the use of muted colors and additional styles. The library can be used for detailed control of style, but simply importing it provides a distinct improvement over the default primary colors.


    Line Color, Style, Markers

    import matplotlib.pyplot as plt
    
    line_1_data = [1, 2, 3, 2, 4, 3, 5, 4, 6]
    line_2_data = [6, 4, 5, 3, 4, 2, 3, 2, 1]
    
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.plot(line_1_data, linestyle='dotted', color='red')
    ax.plot(line_2_data, linestyle='dashed', color='green', marker='o')


    Line Styles

    '-'solid line style
    '--'dashed line style
    '-.'dash-dot line style
    ':'dotted line style
    '.'point marker
    ','pixel marker
    'o'circle marker
    'v'triangle_down marker
    '^'triangle_up marker
    '<'triangle_left marker
    '>'triangle_right marker
    '1'tri_down marker
    '2'tri_up marker
    '3'tri_left marker
    '4'tri_right marker
    's'square marker
    'p'pentagon marker
    '*'star marker
    'h'hexagon1 marker
    'H'hexagon2 marker
    '+'plus marker
    'x'x marker
    'D'diamond marker
    'd'thin_diamond marker
    '|'vline marker
    '_'hline marker
    Line Styles
    'b'blue
    'g'green
    'r'red
    'c'cyan
    'm'magenta
    'y'yellow
    'k'black
    'w'white


    Setting Axis Ticks and Tick Range

    import imp
    plt = imp.reload(plt)
    
    ydata = [1, 2, 3, 2, 4, 3, 5, 4, 6]
    xdata = [0, 10, 20, 30, 40, 50, 60, 70, 80]    # (this is the default if no list is passed for x)
    
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    
    ax.set_yticks([2, 4, 6, 8])
    ax.set_xticks([0, 25, 50, 75, 100])
    
    ax.set_ylim(0, 10)
    ax.set_xlim(0, 100)
    
    ax.set_xticklabels(['zero', 'twenty-five', 'fifty', 'seventy-five', 'one hundred'],
                       rotation=30, fontsize='small')
    
    line1, = ax.plot(xdata, ydata)
    line2, = ax.plot([i+10 for i in xdata], ydata)
    
    ax.legend([line1, line2], ['this line', 'that line'])

    The ticks on the y axis (vertical) are set based on the data values of the first list passed. The ticks on the x axis (horizontal) are set based on the data values of the second list passed. setting the tick range limit


    ax.set_ylim(0, 10)
    ax.set_xlim(0, 100)

    setting the ticks specifically


    ax.set_yticks([2, 4, 6, 8])
    ax.set_xticks([0, 25, 50, 75, 100])

    setting tick labels


    ax.set_xticklabels(['zero', 'twenty-five', 'fifty', 'seventy-five', 'one hundred'],
                        rotation=30, fontsize='small')

    plt.grid(True) to add a grid to the figure


    plt.grid(True)

    setting a legend


    line1, = ax.plot(xdata, ydata)
    line2, = ax.plot([i+10 for i in xdata], ydata)
    
    ax.legend([line1, line2], ['this line', 'that line'])

    Saving to File

    fig.savefig() saves the figure to a file.


    fig.savefig('myfile.png')

    The filename extension of a saved figure determines the filetype.

    print(fig.canvas.get_supported_filetypes())
    
        # {'eps': 'Encapsulated Postscript',
        #  'pdf': 'Portable Document Format',
        #  'pgf': 'PGF code for LaTeX',
        #  'png': 'Portable Network Graphics',
        #  'ps': 'Postscript',
        #  'raw': 'Raw RGBA bitmap',
        #  'rgba': 'Raw RGBA bitmap',
        #  'svg': 'Scalable Vector Graphics',
        #  'svgz': 'Scalable Vector Graphics'}

    Visualizing pandas Series and DataFrame

    pandas has fully incorporated matplotlib into its API.


    pandas Series objects have a plot() method that works

    import pandas as pd
    import numpy as np
    
    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    ts = ts.cumsum()
    ts.plot(kind="line")   # "line" is default

    pandas DataFrames also have a .plot() method that plots multiple lines


    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4]})
    df.plot()

    Pandas DataFrames also have a set of methods that create the type of chart desired.


    df.plot.area     df.plot.barh     df.plot.density  df.plot.hist     df.plot.line     df.plot.scatter
    df.plot.bar      df.plot.box      df.plot.hexbin   df.plot.kde      df.plot.pie

    The pandas visualization documentation can be found here: http://pandas.pydata.org/pandas-docs/stable/visualization.html


    Bar Charts

    .


    import matplotlib.pyplot as plt
    
    langs =    ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')
    langperf = [     10,     8,      6,      4,       2,      1]
    
    y_pos = np.arange(len(langs))
    
    plt.bar(y_pos, langperf, align='center', alpha=0.5)
    plt.xticks(y_pos, langs)
    plt.ylabel('Usage')
    plt.title('Programming language usage')

    Pie Charts

    Pie charts set slice values as portions of a summed whole


    import numpy as np
    import matplotlib.pyplot as plt
    plt.pie([2, 3, 10, 20])

    Scatterplot

    Scatterplots Set points at x,y coordinates, at varying sizes and colors


    import matplotlib.pyplot as plt
    import numpy as np
    
    N = 50
    x = np.random.rand(N)
    y = np.random.rand(N)
    colors = np.random.rand(N)
    area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radii
    
    plt.scatter(x, y, s=area, c=colors, alpha=0.5)



    Flask

    Web Frameworks

    Introduction


    A "web framework" is an application or package that facilitates web programming. Server-side apps (for example: a catalog, content search and display, reservation site or most other interactive websites) use a framework to handle the details of the web network request, page display and database input/output -- while freeing the programmer to supply just the logic of how the app will work. Full Stack Web Frameworks A web application consists of layers of components that are configured to work together (i.e., built into a "stack"). Such components may include: * authenticating and identifying users through cookies * handling data input from forms and URLs * reading and writing data to and from persistant storage (e.g. databases) * displaying templates with dynamic data inserted * providing styling to templates (e.g., with css) * providing dynamic web page functionality, as with AJAX The term "full stack developer" used by schools and recruiters refers to a developer who is proficient in all of these areas. Django is possibly the most popular web framework. This session would probably focus on Django, but its configuration and setup requirements are too lengthy for the available time. "Lightweight" Web Frameworks A "lightweight" framework provides the base functionality needed for a server-side application, but allows the user to add other stack components as desired. Such apps are typically easier to get started with because they require less configuration and setup. Flask is a popular lightweight framework with many convenient defaults allows us to get our web application started quickly.


    The Flask app object, the app.run() method and @app.route dispatch decorators

    "@app.route()" functions describe what happens when the user visits a particular "page" or URL shown in the decorator.


    hello_flask.py


    Here is a basic template for a Flask app.

    #!/usr/bin/env python
    
    import flask
    app = flask.Flask(__name__)           # a Flask object
    
    @app.route('/hello')                 # called when visiting web URL 127.0.0.1:5000/hello/
    def hello_world():
        print('*** DEBUG:  inside hello_world() ***')
        return '<PRE>Hello, World!</PRE>'            # expected to return a string (usu. the HTML to display)
    
    if __name__ == '__main__':
        app.run(debug=True, port=5000)    # app starts serving in debug mode on port 5000

    The first two lines and last two lines will always be present (production apps will omit the debug= and port= arguments). app is an object returned by Flask; we will use it for almost everything our app does. We call it a "god object" because it's always available and contains most of what we need. app.run() starts the Flask application server and causes Flask to wait for new web requests (i.e., when a browser visit the server). @app.route() functions are called when a particular URL is requested. The decorator specifies the string to be found at the end of the URL. For example, the above decorator @app.route('/hello') specifies that the URL to reach the function should be http://localhost:5000/hello/ The string returned from the function is passed on to Flask and then to the browser on the other end of the web request.


    Running a Flask app locally

    Flask comes complete with its own self-contained app server. We can simply run the app and it begins serving locally. No internet connection is required.


    $ python hello_flask.py
     * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
     * Restarting with stat
    127.0.0.1 - - [13/Nov/2016 15:58:16] "GET / HTTP/1.1" 200 -
    *** DEBUG:  inside hello_world() ***

    The Flask app prints out web server log messages showing the URL requested by each visitor. You can also print error messages directly from the application, and they will appear in the log (these were printed with *** strings, for visibility.) Changes to Flask code are detected and cause Flask to restart the server; errors cause it to exit


    Whenever you make a change and save your script, the Flask server will restart -- you can see it issue messages to this effect:

     * Detected change in '/Users/dblaikie/Dropbox/tech/nyu/advanced_python/solutions/flask/guestbook/guestbook_simple.py', reloading
     * Restarting with stat
     * Debugger is active!
     * Debugger pin code: 161-369-356

    If there is an error in your code that prevents it from running, the code will raise an exception and exit.

     * Detected change in '/Users/dblaikie/Dropbox/tech/nyu/advanced_python/solutions/flask/guestbook/guestbook_simple.py', reloading
     * Restarting with stat
      File "./guestbook_simple.py", line 17
        return hello, guestbook_id ' + guestbook_id
                                                  ^
    SyntaxError: EOL while scanning string literal

    At that point, the browser will simply report "This site can't be reached" with no other information.


    Therefore we must keep an eye on the window to see if the latest change broke the script -- fix the error in the script, and then re-run the script in the Terminal.


    Redirection and Event Flow

    Once an 'event' function is called, it may return a string, call another function, or redirect to another page.


    Return a plain string


    A plain string will simply be displayed in the browser. This is the simplest way to display text in a browser.

    @app.route('/hello')
    def hello_world():
        return '<PRE>Hello, World!</PRE>'            # expected to return a string (usu. the HTML to display)


    Return HTML (also a string)


    HTML is simply tagged text, so it is also returned as a string.

    @app.route('/hello_template')
    def hello_html():
        return """
    <HTML>
      <HEAD>
        <TITLE>My Greeting Page</TITLE>
      </HEAD>
      <BODY>
        <H1>Hello, world!</H1>
      </BODY>
    </HTML>"""


    Return an HTML Template (to come)


    This method also returns a string, but it is returned from the render_template() function.

    @app.route('/hello_html')
    def hello_html():
        return flask.render_template('response.html')    # found in templates/response.html


    Return another function call


    Functions that aren't intended to return strings but to perform other actions (such as making database changes) can simply call other functions that represent the desired destination:

    def hello_html():
        return """
    <HTML>
      <HEAD>
        <TITLE>My Greeting Page</TITLE>
      </HEAD>
      <BODY>
        <H1>Hello, world (from another function)  !</H1>
      </BODY>
    </HTML>"""
    
    @app.route('/hello_template')
    def hello():
        return hello_html()

    Because hello() is calling hello_html() in its return statement, whatever is returned from there will be returned from hello().


    Redirecting to another program URL with flask.redirect() and flask.url_for()


    At the end of a function we can call the flask app again through a page redirect -- that is, to have the app call itself with new parameters.

    from datetime import date
    
    @app.route('/sunday_hello')
    def sunday_hello():
        return "It's Sunday!  Take a rest!"
    
    @app.route('/shello')
    def hello():
        if date.today().strftime('%a') == 'Sun':
            return flask.redirect(flask.url_for('sunday_hello'))
        else:
            return 'Hello, workday (or Saturday)!'

    redirect() issues a redirection to a specified URL; this can be http://www.google.com or any desired URL.


    url_for() simply produces the URL that will call the flask app with /login.


    Building a Multi-Page Application

    Use flask.url_for() to build links to other apps


    This app has three pages; we can build a "closed system" of pages by having each page link to another within the site.


    happy_image = 'http://davidbpython.com/advanced_python/python_data/happup.jpg'
    sad_image = 'http://davidbpython.com/advanced_python/python_data/sadpup.jpg'
    
    @app.route('/question')
    def ask_question():
        return """
    <HTML>
      <HEAD><TITLE>Do you like puppies?</TITLE></HEAD>
      <BODY>
          <H3>Do you like puppies?</H3>
          <A HREF="{}">arf!</A><BR>
          <A HREF="{}">I prefer cats...</A>
      </BODY>
    </HTML>""".format(flask.url_for('yes'), flask.url_for('no'))
    
    
    @app.route('/yes')
    def yes():
        return """
    <HTML>
      <HEAD><TITLE>C'mere Boy!</TITLE></HEAD>
      <BODY>
          <H3>C'mere, Boy!</H3>
          <IMG SRC="{}"><BR>
          <BR>
          Change your mind?  <A HREF="{}">Let's try again.</A>
      </BODY>
    </HTML>""".format(happy_image, flask.url_for('ask_question'))
    
    
    @app.route('/no')
    def no():
        return """
    <HTML>
      <HEAD><TITLE>Aww...</TITLE></HEAD>
      <BODY>
          <H3>Aww...really?</H3>
          <IMG SRC="{}"><BR>
          <BR>
          Change your mind?  <A HREF="{}">Let's try again.</A>
      </BODY>
    </HTML>""".format(sad_image, flask.url_for('ask_question'))

    Simple Templating

    Use {{ varname }} to create template tokens and flask.render_template() to insert to them.


    HTML pages are rarely written into Flask apps; instead, we use standalone template files. The template files are located in a templates directory placed in the same directory as your Flask script.


    question.html

    <HTML>
      <HEAD><TITLE>Do you like puppies?</TITLE></HEAD>
      <BODY>
          <H3>Do you like puppies?</H3>
          <A HREF="{{ yes_link }}">arf!</A><BR>
          <A HREF="{{ no_link }}">I prefer cats...</A>
      </BODY>
    </HTML>

    puppy.html

    <HTML>
      <HEAD><TITLE><{{ title_message }}</TITLE></HEAD>
      <BODY>
          <H3>{{ title_message }}</H3>
          <IMG SRC="{{ puppy_image }}"><BR>
          <BR>
          Change your mind?  <A HREF="{{ question_link }}">Let's try again.</A>
      </BODY>
    </HTML>

    puppy_question.py

    happy_image = 'http://davidbpython.com/advanced_python/python_data/happup.jpg'
    sad_image =   'http://davidbpython.com/advanced_python/python_data/sadpup.jpg'
    
    
    @app.route('/question')
    def ask_question():
        return flask.render_template('question.html',
                                      yes_link=flask.url_for('yes'),
                                      no_link=flask.url_for('no'))
    
    @app.route('/yes')
    def yes():
        return flask.render_template('puppy.html',
                                      puppy_image=happy_image,
                                      question_link=flask.url_for('ask_question'),
                                      title_message='Cmere, boy!')
    
    @app.route('/no')
    def no():
        return flask.render_template('puppy.html',
                                      puppy_image=sad_image,
                                      question_link=flask.url_for('ask_question'),
                                      title_message='Aww... really?')

    Embedding Python Code in Templates

    Use {% %} to embed Python code for looping, conditionals and some functions/methods from within the template.


    Template document: "template_test.html"


    <!DOCTYPE html>
    <html lang="en">
      <head>
        <title>Important Stuff</title>
      </head>
      <body>
    
      <h1>Important Stuff</h1>
    
      Today's magic number is {{ number }}<br><br>
    
      Today's strident word is {{ word.upper() }}<br><br>
    
      Today's important topics are:<br>
      {% for item in mylist %}
          {{ item }}<br>
      {% endfor %}
      <br><br>
    
      {% if reliability_warning %}
          WARNING:  this information is not reliable
      {% endif %}
    
      </body>
    </html>

    Flask code


    @app.route('template_test')
    def template_test():
        return flask.render_template('template_test.html', number=1035,
                                                           word='somnolent',
                                                           mylist=['children', 'animals', 'bacteria'],
                                                           reliability_warning=True)

    As before, {{ variable }} can used for variable insertions, as well as instance attributes, method and function calls
    {% for this in that %} can be used for 'if' tests, looping with 'for' and other basic control flow


    Reading args from URL or Form Input

    Input from a page can come from a link URL, or from a form submission.


    name_question.html

    <HTML>
      <HEAD>
      </HEAD>
      <BODY>
        What is your name?<BR>
        <FORM ACTION="{{ url_for('greet_name') }}" METHOD="post">
          <INPUT NAME="name" SIZE="20">
          <A HREF="{{ url_for('greet_name') }}?no_name=1">I don't have a name</A>
          <INPUT TYPE="submit" VALUE="tell me!">
        </FORM>
      </BODY>
    </HTML>

    flaskapp.py

    @app.route('/name_question')
    def ask_name():
        return flask.render_template('name_question.html')
    
    @app.route('/greet', methods=['POST', 'GET'])
    def greet_name():
        name =    flask.request.form.get('name')     # from a POST (form with 'method="POST"')
        no_name = flask.request.args.get('no_name')  # from a GET (URL)
    
        if name:
            msg = 'Hello, {}!'.format(name)
        elif no_name:
            msg = 'You are anonymous.  I respect that.'
        else:
            raise ValueError('\nraised error:  no "name" or "no_name" params passed in request')
    
        return '<PRE>{}</PRE>'.format(msg)
    
    

    Base templates

    Many times we want to apply the same HTML formatting to a group of templates -- for example the <head> tag, which my include css formatting, javascript, etc.


    We can do this with base templates:

    {% extends "base.html" %}         # 'base.html' can contain HTML from another template
        <h1>Special Stuff</h1>
        Here is some special stuff from the world of news.

    The base template "surrounds" any template that imports it, inserting the importing template at the {% block body%} tag:

    <html>
      <head>
      </head>
      <body>
        <div class="container">
          {% block body %}
             <H1>This is the base template default body.</H1>
          {% endblock %}
        </div>
      </body>
    </html>

    There are many other features of Jinja2 as well as ways to control the API, although I have found the above features to be adequate for my purposes.


    Sessions

    Sessions (usually supported by cookies) allow Flask to identify a user between requests (which are by nature "anonymous").


    When a session is set, a cookie with a specific ID is passed from the server to the browser, which then returns the cookie on the next visit to the server. In this way the browser is constantly re-identifying itself through the ID on the cookie. This is how most websites keep track of a user's visits.


    flask_session.py

    import flask
    app = flask.Flask(__name__)
    
    app.secret_key = 'A0Zr98j/3yX R~XHH!jmN]LWX/,?RT'  # secret key
    
    @app.route('/index')
    def hello_world():
    
        # see if the 'login' link was clicked:  set a session ID
        user_id = flask.request.args.get('login')
        if user_id:
            flask.session['user_id'] = user_id
            is_session = True
    
        # else see if the 'logout' link was clicked:  clear the session
        elif flask.request.args.get('logout'):
            flask.session.clear()
    
        # else see if there is already a session cookie being passed:  retrieve the ID
        else:
            # see if a session cookie is already active between requests
            user_id = flask.session.get('user_id')
    
        # tell the template whether we're logged in (user_id is a numeric ID, or None)
        return flask.render_template('session_test.html', is_session=user_id)
    
    
    if __name__ == '__main__':
        app.run(debug=True, port=5001)    # app starts serving in debug mode on port 5001

    session_test.html

    <!DOCTYPE html>
    <html lang="en">
      <head>
        <title>Session Test</title>
      </head>
      <body>
    
      <h1>Session Test</h1>
    
      {% if is_session %}
      <font color="green">Logged In</font>
      {% else %}
      <font color="red">Logged Out</font>
      {% endif %}
    
      <br><br>
    
      <a href="index?login=True">Log In</a><br>
      <a href="index?logout=True">Log Out</a><br>
    
      </body>
    </html>

    Config Values

    Configuration values are set to control how Flask works as well as to be set and referenced by an individual application.


    Flask sets a number of variables for its own behavior, among them DEBUG=True to display errors to the browser, and SECRET_KEY='!jmNZ3yX R~XWX/r]LA098j/,?RTHH' to set a session cookie's secret key. A list of Flask default configuration values is here. Retrieving config values


    value = app.config['SERVER_NAME']

    Setting config values individually


    app.config['DEBUG'] = True

    Setting config values from a file


    app.config.from_pyfile('flaskapp.cfg')

    Such a file need only contain python code that sets uppercased constants -- these will be added to the config. Setting config values from a configuration Object Similarly, the class variables defined within a custom class can be read and applied to the config with app.config.from_object(). Note in the example below that we can use inheritance to distribute configs among several classes, which can aid in organization and/or selection:


    In a file called configmodule.py:

    class Config(object):
        DEBUG = False
        TESTING = False
        DATABASE_URI = 'sqlite://:memory:'
    
    class ProductionConfig(Config):
        DATABASE_URI = 'mysql://user@localhost/foo'
    
    class DevelopmentConfig(Config):
        DEBUG = True
    
    class TestingConfig(Config):
        TESTING = True

    In the flask script:

    app.config.from_object('configmodule.ProductionConfig')

    Environment Variables

    Environment Variables are system-wide values that are set by the operating system and apply to all applications. They can also be set by individual applications.


    The OpenShift web container sets a number of environment variables, among them OPENSHIFT_LOG_DIR for log files and OPENSHIFT_DATA_DIR for data files. Flask employs jinja2 templates. A list of Openshift environment variables can be found here.


    Flask and security

    An important caveat regarding web security: Flask is not considered to be a secure approach to handling sensitive data.


    ...at least, that was the opinion of a former student, a web programmer who worked for Bank of America, about a year ago -- they evaluated Flask and decided that it was not reliable and could have security vulnerabilities. His team decided to use CGI -- the baseline protocol for handling web requests. Any framework is likely to have vulnerabilities -- only careful research and/or advice of a professional can ensure reliable privacy. However for most applications security is not a concern -- you will simply want to avoid storing sensitive data on a server without considering security.


    Flask-Specific Errors

    Keep these Flask-specific errors in mind.


    Page not found


    URLs specified in @app.route() functions should not end in a trailing slash, and URLs entered into browsers must match

    @app.route(/'hello')

    If you omit a trailing slash from this path but include one in the URL, the browser will respond that it can't find the page.


    What's worse, some browsers sometimes try to 'correct' your URL entries, so if you type a URL with a trailing slash and get "Page not found", the next time you type it differently (even if correctly) the browser may attempt to "correct" it to the way you typed it the first time (i.e., incorrectly). This can be extremely frustrating; the only remedy I have found is to clear the browser's browsing data.


    Functions must return strings (or redirect to another function or URL); they do not print page responses.

    @app.route(/'hello')
    def hello():
        return 'Hello, world!'        # not print 'Hello, world!'

    Each routing function expects a string to be returned -- so the function must do one of these:


       1) return a string (this string will be displayed in the browser)    2) call another @app.route() function that will return a string    3) issue a URL redirect (described later) Method not allowed usually means that a form was submitting that specifies method="POST" but the @app.route decorator doesn't specify methods=['POST']. See "Reading args from URL or Form Input", above.


    name_question.html

    <HTML>
      <HEAD>
      </HEAD>
      <BODY>
        What is your name?<BR>
        <FORM ACTION="{{ url_for('greet_name') }}" METHOD="post">
          <INPUT NAME="name" SIZE="20">
          <A HREF="{{ url_for('greet_name') }}?no_name=1">I don't have a name</A>
          <INPUT TYPE="submit" VALUE="tell me!">
        </FORM>
      </BODY>
    </HTML>

    If the form above submits data as a "post", the app.route() function would need to specify this as well:

    @app.route('/greet', methods=['POST', 'GET'])
    def greet_name():
       ...



    Server and Web Technologies

    Important Server and Web Technologies

    Basic proficiency in web technologies is a must for any IT Professional.


    Essential skills:


    The Internet Host

    An internet host is a computer connected on a network that is accessible from any online computer.



    Establishing an Account on a Host

    Many different kinds of hosting accounts and services are on offer.



    Editing Files Directly on a Server

    pico is an easy terminal-based eidtor; vi and emacs are traditional IT favorites


    [ to come ]


    Copying Files to and from a Server

    scp is the command-line utility; other hosts provide web-based copying


    [ to come ]


    The Web Server

    The web server is a program that serves out web pages and runs web programs.


    Web Servers on Hosting Plans Most web hosting plans (often costing $5 - $15 per month) provide an htdocs folder for serving out pages and a cgi-bin for executing web programs. The simplest and most traditional way to put your programs on the web is to sign up for a hosting plan and upload your scripts to the cgi-bin. However


    PythonAnywhere for Free, Easy and Scalable Application Hosting

    They're not only easy, they also seem to be extremely earnest.


    [ to come ]




    Web Clients and Scraping

    Basic HTTP

    Headers, Cookie Headers and Response Codes


    http:// (HyperText Transport Protocol) is the protocol for sending and receiving messages from a browser (the client) to a web server (the server) and back again. We call the browser's message the request and the server's response the response. request A request consists of two parts: the URL (along with any parameters) and (optionally) any content. A request is generally one of two methods: GET (for retrieving data) or POST (for sending data, like form input). Note in this context method is not related to a Python method. HTTP Headers are meta information sent along with a request. This may include session (cookie) information. response The content of a response is the HTML, text or other data (can be binary data, or anything else a browser can send or receive). The response may also include headers. These include the response code, the size of the response, and may also contain cookies. The response code of a response indicates whether the request was handled without error (200), whether there was a server error (500), etc.


    requests: requests.get() and requests.post()

    The requests module allows your Python script to act as a web client (i.e., like a browser)


    A web client is any program that can send HTTP requests to a web server. Your browser is a web client. The requests module lets us issue HTTP requests in the same way a browser would. We can therefore have our Python program serve as a web client and behave like a browser, i.e. to visit web pages and download data.
    Requesting a web page through requests.get() or requests.post()


    A page is requested as an HTTP GET or POST request:

    import requests
    
    response = requests.get('http://www.nytimes.com')      # or .post()
    
    page_text =   response.text                # entire downloaded file
    status_code = response.status_code         # HTTP code indicating status (success (200), not found (404), forbidden, etc.)
    
    page_text = page_text.encode('utf-8')      # usu. not necessary
    
    print('status code:  {}'.format(status_code))
    print('======================= page text =======================')
    print(page_text)

    From the client's perspective, GET and POST are often interchangeable. POST may be required by some websites for certain requests. POST is usually used to submit information from a form (as contrasted with requesting a page to view) although in most cases neither type is limited to a particular purpose.


    requests: joining a relative URL to an absolute one

    Many links on a page are relative to the page. The URL must be completed to be used.


    from requests.compat import urljoin, quote_plus
    
    url = 'http://some-address.com/'
    
    relative_link = 'api/myfile.html'             # the sort of link found on a webpage
    
    completed_link = urljoin(url, relative_link)
    
    print(completed_link)   # http://some-address.com/api/myfile.html

    requests: requesting a web page with parameters

    Many web requests include parameters specifying what content or action is desired.


    To pass parameters, we simply include a dict passed to get() as params=.

    pdict = {'assignment_id': '1.1', 'student_id': 'bill_hanson'}
    
    response = requests.get('https://young-tundra-64507.herokuapp.com/route_view',
                            params=pdict)

    In a GET request, the parameters appear in the URL:

    https://young-tundra-64507.herokuapp.com/route_view?assignment_id=1.1&student_id=bill_hanson

    The parameters here are assignment_id (value 1.1) and student_id (value bill_hanson)


    requests: posting data to a website, with parameter input

    A form submission is usually sent as a POST request and includes parameter data. Here is a sample form:

    <FORM ACTION="http://www.mywebsite.com/user" METHOD="POST">
      <INPUT NAME="firstname"><BR>
      <INPUT NAME="lastname"><BR>
      <INPUT NAME="password" TYPE="password">
      <INPUT TYPE="submit">
    </FORM>

    This form produces key/value data in the body of the request.


    We can replicate this kind of request by using request.post() and passing a dict of param keys and values:

    userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
    
    resp = requests.post('http://www.mywebsite.com/user', params=userdata)

    This post() call uses the same syntax as the get() call.


    requests: decoding a JSON response

    API calls are also made over HTTP, and if the call returns JSON, this can easily be decoded through requests.

    import requests
    
    response = requests.get('http://api.wunderground.com/api/d2e101aa48faa661/conditions/q/CA/San_Francisco.json')
    conditions_json = response.json()
    
    print(conditions_json["current_observation"]["temp_f"])        # 84.5 (accessing data within the JSON)

    requests: cookies

    Cookies are used to maintain "session" information across HTTP requests.


    A session represents multiple visits by the same user to a website. The website is able to identify a user through the use of cookies, which are small bits of identifying information returned from a website upon a first visit; the browser is then expected to pass these same cookies to the website upon each subsequent visit. Because the identifying information is unique to the user, the website can "track" the user's visits to the site through the cookie. Cookies are often used to carry login information. This is why a user can log into a website on a first request, and the website will re-authenticate the user through the cookie on all subsequent requests.


    In the first request below, we are presumably logging into a website (with parameter input that mimics a form submission); in the next request, we are using the cookies returned from the first request to authenticate.

    import requests
    
    login = {"username": "david", "password": "jdoe123"}
    resp = requests.post('http://www.mywebsite.com/login', params=login)
    
    r2 = requests.post('http://www.mywebsite.com/newpage',cookies=resp.cookies)

    However, note that many websites use sophisticated authentication tokens or other methods to discourage logins by automated systems; the 'Captcha' system usually requires that a human being evaluate a photo or sound to prove that he/she is not a bot (i.e., a script).


    requests: other features

    requests is popular because of its ease of use and complement of features.



    Python on the client side: the urllib module

    urllib is an alternative to requests for making web requests. It comes installed with Python.


    Although the requests module is strongly favored by some for its simplicity, it has not yet been added to the Python builtin distribution.


    The urlopen method takes a url and returns a file-like object that can be read() as a file:

    import urllib.request
    my_url = 'http://www.google.com'
    readobj = urllib.request.urlopen(my_url)  # return a 'file-like' object
    text = readobj.read()                     # read into a 'byte string'
    # text = text.decode('utf-8')             # optional, sometimes required:
                                              # decode as a 'str' (see below)
    readobj.close()

    Alternatively, you can call readlines() on the object (keep in mind that many objects that can deliver file-like string output can be read with this same-named method):

    for line in readobj.readlines():
      print(line)
    readobj.close()

    The text that is downloaded is CSV, HTML, Javascript, and possibly other kinds of data. POTENTIAL ERRORS AND REMEDIES WITH urllib


    TypeError mentioning 'bytes' -- sample exception messages:

    TypeError: can't use a string pattern on a bytes-like object
    TypeError: must be str, not bytes
    TypeError: can't concat bytes to str

    These errors indicate that you tried to use a byte string where a str is appropriate.


    The urlopen() response usually comes to us as a special object called a byte string. In order to work with the response as a string, we can use the decode() method to convert it into a string with an encoding.

    text = text.decode('utf-8')

    'utf-8' is the most common encoding, although others ('ascii', 'utf-16', 'utf-32' and more) may be required.


    I have found that we do not always need to convert (depending on what you will be doing with the returned string) which is why I commented out the line in the first example. SSL Certificate Error Many websites enable SSL security and require a web request to accept and validate an SSL certificate (certifying the identity of the server). urllib by default requires SSL certificate security, but it can be bypassed (keep in mind that this may be a security risk).


    import ssl
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    my_url = 'http://www.nytimes.com'
    readobj = urllib.request.urlopen(my_url, context=ctx)

    Download binary files: images and other files can be saved locally using urllib.requests.urlretrieve().


    import urllib.request
    
    urllib.requests.urlretrieve('http://www.azquotes.com/picture-quotes/quote-python-is-an-experiment-in-how-much-freedom-programmers-need-too-much-freedom-and-nobody-guido-van-rossum-133-51-31.jpg', 'guido.jpg')

    Note the two arguments to urlretrieve(): the first is a URL to an image, and the second is a filename -- this file will be saved locally under that name.


    Encoding Parameters: urllib.requests.urlencode() When including parameters in our requests, we must encode them into our request URL. The urlencode() method does this nicely:


    import urllib.request, urllib.parse
    
    params = urllib.parse.urlencode({'choice1': 'spam and eggs', 'choice2': 'spam, spam, bacon and spam'})
    print("encoded query string: ", params)
    f = urllib.request.urlopen("http://www.google.com?{}".format(params))
    print(f.read())

    this prints:

    encoded query string: choice1=spam+and+eggs&choice2=spam%2C+spam%2C+bacon+and+spam
    
    choice1:  spam and eggs<BR>
    choice2:  spam, spam, bacon and spam<BR>

    Web Scraping with Beautiful Soup (bs4)

    Beautiful Soup parses XML or HTML documents, making text and attribute extraction a snap.


    Here we are passing the text of a web page (obtained by requests) to the BS parser:

    from bs4 import BeautifulSoup
    import requests
    
    response = requests.get('http://www.nytimes.com')
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # show HTML in "pretty" form
    print(soup.prettify())
    
    # show all plain text in a page
    print(soup.get_text())

    The result is a BeautifulSoup object which we can use to search for tags and data.


    For the following examples, let's use the HTML provided on the Beatiful Soup Quick Start page:

    <!doctype html>
    <html>
      <head>
        <title>The Dormouse's story</title>
      </head>
      <body>
        <p class="story_title"><b>The Dormouse's story</b></p>
    
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
    
        <p class="story">They were happy, and eventually died.  The End.</p>
      </body>
    </html>

    BeautifulSoup documentation can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/


    bs4: the Tag object

    Tags' attributes and contents can be read; they can also be queried for tags and text within the tag


    body_text = """
        <BODY class="someclass otherclass">
            <H1 id='mytitle'<This is a headings</H1>
            <A href="mysite.com"<This is a link</A>
        </BODY>
    """

    An HTML tag can contain four types of data: 1. The tag's name ('BODY', 'H1', 'A', etc.) 2. The tag's attributes (<BODY class=, <H1 id= or <A href=) 3. The tag's text ('This is a header' or 'This is a link') 4. The tag's contents (i.e., tags within it -- for <BODY>, the <H1> and <A> tags)


    from bs4 import BeautifulSoup
    soup = BeautifulSoup(body_text, 'html.parser')
    
    h1 = soup.body.h1  # h1 is a Tag object
    h1.name            # name of the tag:         u'h1'
    h1.text            # text of the tag:         u'This is a heading'
    h1.get('id')       # 'id' attribute value:    u'mytitle'
    h1['id']           # same:                    u'mytitle'
    h1.attrs           # all attrs as a dict:     {u'id': u'mytitle'}
    
    body = soup.body   # body is a Tag object
    body.name          # name of the tag:         u'body'
    body.text          # text of the tag:         u'\nThis is a heading\nThis is a link\n'
    body.get('class')  # 'class' attribute value: ['someclass', 'otherclass']
    body['class']      # same:                    ['someclass', 'otherclass']
    body.attrs         # all attrs as a dict:     {'class': ['someclass', 'otherclass']}

    A tag's child tags can be searched in the same way as the BeautifulSoup object

    body = soup.body         # find the <body> tag in this document
    
    atag = body.find('a')    # find first <a> tag in this <body> tag

    bs4: finding the first tag by name with attribute or find()

    The soup object (or any tag) can be searched for the first instance of a tag



    Finding the first tag by name using soup.attribute


    The BeautifulSoup object's attributes can be used to search for a tag. The first tag with that name will be returned.

    # first (and only) <title> tag
    print(soup.title)              # <title>The Dormouse's story</title>
    
    # first (of several) <p> tags
    print(soup.p)                  # <p class="title"><b>The Dormouse's story</b></p>

    Attributes can be chained to drill down to a particular tag:

    print(soup.body.p.b)           # <b>The Dormouse's story</b>

    However keep in mind these represent the first of each tag found.



    Finding the first tag by name: find()


    find() works similarly to an attribute, but filters can be applied (discussed shortly).

    print soup.find('a')          <a class="sister eldest" href="http://example.com/elsie" id="link1">Elsie</a>

    bs4: find_all()

    findall() retrieves a list of all tags with a particular name.


    Beautiful Soup can find all tags with a particular name. The result is a special container object that can be used like a list (subscripting, looping) to retrieve the tags:

    tags = soup.find_all('a')
    
    tag = tags[0]
    print(type(tag))          # <class 'bs4.element.Tag'>
    
    for tag in tags:
        print(tag['href'])    # http://yoursite.com

    bs4: find() and findall() using tag name, attribute and content criteria

    Tag criteria can focus on a tag's name, its attributes, or text within the tag.



    SEARCHING NAME, ATTRIBUTE OR TEXT


    Finding a tag by name: links in a page are marked with the <A> tag (usually seen as <A HREF="">). This call pulls out all links from a page:

    link_tags = soup.find_all('a')



    Finding a tag by tag attribute and/or name and tag attribute

    # all <a> tags with an 'id' attribute of link1
    link1_a_tags = soup.find_all('a', id="link1")
    
    link1_a_tags = soup.find_all('a', {'id': "link1"})   # alt. syntax (required for 'class' or
                                                         #              'name' attributes)
    
    # all tags (of any name) with an 'id' attribute of link1
    link1_tags = soup.find_all(id="link1")



    "multi-value" tag attribute: CSS allows multiple values in an attribute:

    <a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>

    If we'd like to find a tag through this value, we pass a list:

    link1_elsie_tag = soup.find({'class': ['sister', 'eldest']})



    Finding a tag by string within the tag's text

    elsie_tags = soup.find_all('a', text='Dormouse')

    All <a> tags containing text 'Dormouse'




    Special note on "class" and "name" attributes: because these names are used by Python and/or BeautifulSoup, they cannot be used with attribute syntax:

    sister_tags = soup.find_all('a', class="sister")       # SyntaxError ('class' is misinterpreted)
    
    sister_tags = soup.find_all('a', {'class': 'sister'})  # correct


    FILTERING TYPES: STRING, LIST, REGEXP, FUNCTION string: filter on the tag's name


    tags = soup.find_all('a')          # return a list of all <a> tags


    list: filter on tag names


    tags = soup.find_all(['a', 'b'])   # return a list of all <a> or <b> tags


    regexp: filter on pattern match against name


    import re
    tags = soup.find_all(re.compile('^b'))      # a list of all tags whose names start with 'b'

    re.compile() produces a pattern object that is applied to tag names using re.match()



    function: filter if function returns True


    soup.find_all(lambda tag: tag.name == 'a' and 'mysite.com' in tag.get('href'))

    bs4: finding a 'sibling' with .next_sibling and .previous_sibling

    Same-named tags on the same level may be navigated "across" rather than "throughout"


    You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree: sibling_soup.b.next_sibling # text2 sibling_soup.c.previous_sibling # text1


    Sidebar: Sending email with smtplib

    Sending mail is simple when you have an SMTP server running and available on your host computer. Python's smtplib module makes this easy:


    #!/usr/bin/env python
    
    # Import smtplib for the actual sending function
    import smtplib
    
    # Import the email modules we'll need
    from email.mime.text import MIMEText
    
    # Create a text/plain message formatted for email
    msg = MIMEText('Hello, email.')
    
    from_address = 'dbb212@nyu.edu'
    to_address = 'david.beddoe@gmail.com'
    subject = 'Test message from a Python script'
    
    msg['Subject'] = subject
    msg['From'] =    from_address
    msg['To'] =      to_address
    
    s = smtplib.SMTP('localhost')
    s.sendmail(from_address, [to_address], msg.as_string())
    s.quit()



    Regular Expressions: Matching

    Regular Expressions: Introduction

    Regular Expressions, or "Regexes" refer to a declarative language used to match patterns in text, and are used for text validation/inspection, and text extraction. Previously, we have had limited tools for inspecting text: In the case of fixed-width text, we have been able to use a slice.


    line = '19340903  3.4 0.9'
    year = line[0:4]                 # year == 1934

    In the case of delimited text, we have been able to use split()


    line = '19340903,3.4,0.9'
    els = line.split(',')
    
    mkt_rf = els[1]                   # 3.4

    In the case of formatted text, there is no obvious way to do it.


    Regular Expressions: Preview

    This regex pattern matches the elements of the following string, which has a recognizable though non-uniform format:


    import re
    
    dates_str = 'Nov 27-Dec 1: 10am-9pm'
    
    reg = re.search(r'^(\w\w\w)\s(\d{1,2})\-(\w\w\w)\s(\d{1,2}):\s(\d{1,2})am\-(\d{1,2})pm$', dates_str)
    
    print((reg.groups()))

    Reading from left to right, the pattern (shown in the r'' string) says this:

      3 'word' characters,  followed by
      a space,              followed by
      1-2 digit characters, followed by
      a dash,               followed by
      3 'word' characters,  followed by
      a space,              followed by
      1-2 digit characters, followed by
      a colon and a space,  followed by
      1-2 digit characters, followed by
      'am',                 followed by
      a dash,               followed by
      1-2 digit characters, followed by
      'pm'

    The parentheses identify text to be extracted. You can see the text that was extracted in the output. The below regex pattern matches elements of a web server log line in a single statement, without resorting to complicated splitting, slicing or string inspection methods. The bolded portions show log line elements that we wish to extract; we are doing so using the parentheses.


    import re
    
    log_line = '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'
    
    reg = re.search(r'(\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}) - - \[(\d\d\/\w{3}\/\d{4}):(\d\d:\d\d:\d\d) (\-?\d\d\d\d)', log_line)
    print(type(reg))
    
    print(reg.group(1))   # 66.108.19.165
    print(reg.group(2))   # 09/Jun/2003
    print(reg.group(3))   # 19:56:33
    print(reg.group(4))   # -0400

    Patterns look for various classes of text (for example, numbers or letters), as well as literal characters, in various quantities, reading through consecutive characters of the text. Reading from left to right, the pattern (shown in the r'' string) says this:


      2-3 digits, a period, 2-3 digits, a period, 2-3 digits, a period, 2-3 digits,
      followed by a space, dash, space, dash,
      followed by an open square bracket, 2 digits, forward slash, 3 word characters, forward slash, 4 digits,
      followed by a colon, 2 digits, colon, 2 digits, colon, 2 digits, space
      followed by a dash and 4 digits.

    The parentheses identify text to be extracted. You can see the text that was extracted in the output.


    The re module and re.search() function

    re.search() returns True if the pattern matches the text.


    import re          # import the regex library
    
    if re.search(r'~jjk265', line):
        print(line)                         # prints any line with the characters ~jjk265

    re.search() takes two arguments: the string pattern, and the string to be searched. Normally used in an if expression, it will evaluate to True if the pattern matched.


    # weblog contains string lines like this:
      '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'
      '66.108.19.165 - - [09/Jun/2003:19:56:44 -0400] "GET /~dbb212/mysong.mp3 HTTP/1.1" 200 175449'
      '66.108.19.165 - - [09/Jun/2003:19:56:45 -0400] "GET /~jjk265/cd2.jpg HTTP/1.1" 200 175449'
    
    # script snippet:
    for line in weblog.readlines():
      if re.search(r'~jjk265', line):
        print(line)                     # prints 2 of the above lines

    re.findall() for multiple match extraction

    Usually re tries to a match a pattern once -- and after it finds the first match, it quits searching. But we may want to find as many matches as we can -- and return the entire set of matches in a list. findall() lets us do that:


    import re
    
    text = 'one number:  203-291-2921; another number:  212-266-2327'
    numbers = re.findall(r'\d\d\d\-\d\d\d\-\d\d\d\d', text)
    print(numbers)        # ['203-291-2921', '212-266-2327']

    The raw string (r'')

    The raw string is like a normal string, but it does not process escapes. An escaped character is one preceded by a backslash, that turns the combination into a special character. \n is the one we're famliar with - the escaped n converts to a newline character, which marks the end of a line in a multi-line string. A raw string wouldn't process the escape, so r'\n' is literally a backslash followed by an n.


    var = "\n"            # one character, a newline
    var2 = r'\n'          # two characters, a backslash followed by an n

    "not" for negating a search

    not is used to negate a search: "if the pattern does not match". search file


      '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'
      '66.108.19.165 - - [09/Jun/2003:19:56:44 -0400] "GET /~dbb212/mysong.mp3 HTTP/1.1" 200 175449'
      '66.108.19.165 - - [09/Jun/2003:19:56:45 -0400] "GET /~jjk265/cd2.jpg HTTP/1.1" 200 175449'

    code


    for line in weblog.readlines():
        if not re.search(r'~jjk265', line):
            print(line)                      # prints 1 of the above lines -- the one without jjk265

    The 're' Vocabulary

    These terms, used in combination, cover most needs in composing text matching patterns.

    Anchor Characters and the Boundary Character
    (specify that match must occur at start and/or end of string)
    $, ^, \b
    Character Classes
    (match on any of a collection of characters)
    \w, \d, \s, \W, \S, \D
    Custom Character Classes
    (user-specified character class)
    [aeiou], [a-zA-Z]
    The Wildcard
    (matches on any character but newline)
    .
    Quantifiers
    (specifies number of characters applied to immediately preceding character or class)
    +, *, ?
    Custom Quantifiers
    (user-specified quantifier)
    {2,3}, {2,}, {2}
    Groupings
    (designate part of a match for extraction, quantifying or vertical bar alternates)
    (parentheses groups)


    Patterns can match anywhere, but must match on consecutive characters

    Read the string from left to right, looking for the first place the pattern matches on consecutive characters.


    import re
    
    str1 = 'hello there'
    str2 = 'why hello there'
    str3 = 'hel lo'
    
    if re.search(r'hello', str1):  print('matched')   # matched
    if re.search(r'hello', str2):  print('matched')   # matched
    if re.search(r'hello', str3):  print('matched')   # does not match

    Note that 'hello' matches at the start of the first string and the middle of the second string. But it doesn't match in the third string, even though all the characters we are looking for are there. This is because the space in str3 is unaccounted for - always remember - matches take place on consecutive characters.


    Anchors and Boundary

    Boundary characters $ and ^ require that the pattern match starts at the beginning of the string or ends at the end of the string. This program lists only those files in the directory that end in '.txt':


    import os, re
    for filename in os.listdir(r'/path/to/directory'):  # ['this.txt', 'that.jpg', etc.]
        if re.search(r'\.txt$', filename):       # match on '.txt' at end of filename
            print(filename)

    This program prints all the lines in the file that start with Tel:


    for text_line in ['AURORA MALL',
                      'OPEN10:00 AM - 9:00 PM',
                      '14200 E ALAMEDA AVE AURORA, CO 80012',
                      'Tel: (303) 344-9901']:
        if re.search(r'^Tel: ', text_line):    # match on 'Tel:' at start of filename
            print(text_line)

    When they are used as anchors, we will always expect ^ to appear at the start of the pattern, and $ to appear at the end.


    Character Classes

    A character class is a special pattern entity can match on any of a group of characters: any of the "digit" class (0-9), any of the "word" class (letters, numbers and underscore), etc.


    user_input = input('please enter a single-digit integer: ')
    if not re.search(r'^\d$', user_input):
        exit('bad input:  exiting...')

    class designationmembers 
    \d
    [0-9]
    Digits
    \w
    [a-zA-Z0-9_]
    Word characters -- letters, numbers or underscores
    \s
    [ \n\t]
    'Whitespace' characters -- spaces, newlines, or tabs


    Built-in Character Class: digits

    The \d character class matches on any digit. This example lists only those files with names formatted with a particular syntax -- YYYY-MM-DD.txt:


    import re
    dirlist = ('.', '..', '2010-12-15.txt', '2010-12-16.txt', 'testfile.txt')
    for filename in dirlist:
        if re.search(r'^\d\d\d\d-\d\d-\d\d\.txt$', filename):
            print(filename)

    This example looks for the phone number in the line:

    line = 'STORE: (951) 296-5558'
    
    if re.search(r'\(\d\d\d\) \d\d\d\-\d\d\d\d', line):
        print('phone number found')

    Here's another example, looking for a line that ends in what looks like a zip:

    import re
    lines = ['Nordstrom West', '40640 WINCHESTER RD', 'TEMECULA, CA 92591']
    for line in lines:
        if re.search('\d\d\d\d\d$', line):
            print(('city, state: ', line))

    Built-in Character Class: "word" characters

    The \w character class matches on any number, letter or underscore.


    This class is often misapprehended -- it doesn't just match on letters. The name "word" most likely refers to a "word" from the perspective of a Unix systems administrator or programmer.


    In this example, we require the user to enter a username with any "word" characters:

    username = input()
    if not re.search(r'^\w\w\w\w\w$', username):
        print("use five numbers, letters, or underscores\n")

    As you can see, the anchors force the match to start at the start of the string and end at the end of the string -- thus matching only on the whole string.


    Built-in Character Classes: "space" characters

    The \s character class matches on a space, a newline (\n) or a tab (\t).


    This program looks for a state and zip at the end of the line; but note the use of \s -- it's been placed in every place we might find a space:

    line = 'TEMECULA, CA 92591'
    
    if re.search(r',\s\w\w\s\d\d\d\d\d$', line):
        print('looks like a state and zip')

    This program searches for a space anywhere in the string and if it finds it, the match is successful - which means the input isn't successful:

    new_password = input()
    if re.search(r'\s', new_password):
        print("password must not contain spaces")

    Note in particular that the regex pattern \s is not anchored anywhere. So the regex will match if a space occurs anywhere in the string.


    You may also reflect that we treat spaces pretty roughly - always stripping them off. They're always causing problems! And they're invisible, too, and still get in the way. What a nuisance.


    Inverse Character Classes

    Each built-in has a corresponding "uppercased" character class that matches on anything but the original class. Not a digit: \D \D matches on any character that is not a digit, including letters, underscores, punctuation, etc. This program checks for a non-digit in the user's account number:


    account_number = input()
    if re.search(r'\D', account_number):
        print("account number must be all digits!")

    Not a word character: \W \W matches on any character that is not a word character, including punctuation and spaces.


    account_number = input()
    if re.search(r'\W', account_number):
        print("account number must be only letters, numbers, and underscores")

    Not a space character: \S \S matches on any character that is not a space (i.e. letters, numbers, special characters, etc.) These two regexes check for a non-space at the start and end of the string:


    sentence = input()
    if re.search(r'^\S', sentence) and re.search(r'\S$', sentence):
        print("the sentence does not begin or end with a space, tab or newline.")

    Custom Character Classes

    The collection of characters included in a character class can be custom-defined. Consider this table of character classes and the list of characters they match on:

    class designationmembers
    \d
    [0123456789]
    \w
    [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOOPQRSTUVWXYZ0123456789_] or
    [a-zA-Z0-9_]
    \s
    [ \t\n]
  • In fact, the bracketed ranges can be used to create our own character classes. We simply place members of the class within the brackets and use it in the same way we might use \d or the others.


    A custom class can contain a range of characters. This example looks at the first letter of a word to determine if it is a vowel:

    word = input('please enter an object: ')
    
    if re.search(r'^[aeiou]):
        article = 'an'
    else:
        article = 'a'
    print('Here is {} {}'.format(article, word))

    This example looks for letters only (there is no built-in class for letters):


    import re
    input = input("please enter a username, starting with a letter:  ")
    if not re.search(r'^[a-zA-Z]', input):
        exit("invalid user name entered")

    This custom class [.,;:?!] matches on any one of these punctuation characters, and this example identifies single punctuation characters and removes them:


    import re
    text_line = 'Will I?  I will.  Today, tomorrow; yesterday and before that.'
    for word in text_line.split():
        word = re.sub(r'[.,;:?!]$', '', word)
        print(word)

    Negative Custom Character Classes

    Any customer character class can be "inversed" when preceded by a carat character. Like \S for \s, the inverse character class matches on anything not in the list. It is designated with a carrot just inside the open bracket:


    This program rejects a filename that starts with a non-letter:

    ufname = input('please enter a filename: ')
    if re.search(r'^[^A-Za-z]', ufname):
        exit('filename must start with a letter')

    This program loops through and "cleans" non-letters at the end of each word:

    import re
    
    for text_line in open('unknown_text.txt'):
        for word in text_line.split():
            while re.search(r'[^a-zA-Z]$', word):
                word = word[:-1]
            print(word)

    The while loop says "as long as you see a non-letter character at the end, slice the word so that character is excluded".


    It would be easy to confuse the carrot at the start of a string with the carrot at the start of a custom character class -- just keep in mind that one appears at the very start of the string, and the other at the start of the bracketed list.


    The Wildcard (.)

    The wildcard matches on any character that is not a newline.


    import re
    username = input()
    if not re.match(r'^.....$', username):   # five dots here
        print("you can use any characters except newline, but there must \
        be five of them.\n")

    We might surmise this is because we are often working with line-oriented input, with pesky newlines at the end of every line. Not matching on them means we never have to worry about stripping or watching out for newlines.)


    Quantifiers: specifies how many to look for

    The quantifier specifies how many of the immediately preceding character may match by the pattern. We can say three digits (\d{3}), between 1 and 3 word characters (\w{1,3}), one or more letters [a-zA-Z]+, zero or more spaces (\s*), one or more x's (x+). Anything that matches on a character can be quantified.

    +1 or more
    *0 or more
    ?0 or 1
    {3,10}between 3 and 10


    In this example directory listing, we are interested only in files with the pattern config_ followed by an integer of any size. We know that there could be a config_1.txt, a config_12.txt, or a config_120.txt. So, we simply specify "one or more digits":

    import re
    filenames = ['config_1.txt', 'config_10.txt', 'notthis.txt', '.', '..']
    wanted_files = []
    for file in filenames:
        if re.search(r'^config_\d+\.txt$', file):
            wanted_files.append(file)

    Here, we validate user input to make sure it matches the pattern for valid NYU ID. The pattern for an NYU Net ID is: two or three letters followed by one or more numbers:


    import re
    uin = input("please enter your net id:  ")
    if not re.search(r'^[A-Za-z]{2,3}\d+$', uin):
        print("that is not valid NYU Net ID!")

    We could also pull all net ids from a text using findall():


    import re
    ids = re.findall(r'[A-Z][a-z]{2,3}\d+', text)

    A simple email address is one or more word characters followed by an @ sign, followed by a period, followed by 2-4 letters:


    import re
    email_address = input()
    if re.search(r'^\w+@\w+\.[A-Za-z]{2,}$', email_address):
        print("email address validated")

    Of course email addresses can be more complicated than this - but for this exercise it works well.


    re.search(), re.compile() and the compile object

    re.search() is the one-step method we've been using to test matching. Actually, regex matching is done in two steps: compiling and searching. re.search() conveniently puts the two together. In some cases, a pattern should be compiled first before matching begins. This would be useful if the pattern is to be matched on a great number of strings, as in this weblog example:


    import re
    access_log = '/home1/d/dbb212/public_html/python/examples/access_log'
    weblog = open(access_log)
    patternobj = re.compile(r'edg205')
    for line in weblog.readlines():
      if patternobj.search(line):
        print(line, end=' ')
    weblog.close()

    The pattern object is returned from re.compile, and can then be called with search. Here we're calling search repeatedly, so it is likely more efficient to compile once and then search with the compiled object.


    Grouping for Alternates: Vertical Bar

    We can group several characters together with parentheses. The parentheses do not affect the match, but they do designate a part of the matched string to be handled later. We do this to allow for alternate matches, for quantifying a portion of the pattern, or to extract text.


    Inside a group, the vertical bar can indicate allowable matches. In this example, a string will match on any of these words, and because of the anchors will not allow any other characters:

    r'^(y|yes|yeah|yep|yup|yu-huh)$'   # matches any of these words

    We may be interested in only email addresses that belong to a particular domain:

    addrs = re.findall(r'\w+\@\w+\.(com|org|edu|gov)', text')

    Grouping for Quantifying

    Another reason to group would be to quantify a sequence of the pattern. For example, we could search for the "John Rockefeller or John D. Rockefeller" pattern by making the 'D.' optional:


    r'John\s+(D\.\s+)?Rockefeller'

    Grouping for Extraction: Memory Variables

    We use the group() method of the match object to extract the text that matched the group:


    string1 = "Find the nyu id, like dbb212, in this sentence"
    matchobj = re.search(r'([a-z]{2,3}\d+)', string1)
    id = matchobj.group(1)                          # id is 'dbb212'
    

    Here's an example, using our log file. What if we wanted to capture the last two numbers (the status code and the number of bytes served), and place the values into structures?


    log_lines = [
    '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449',
    '216.39.48.10 - - [09/Jun/2003:19:57:00 -0400] "GET /~rba203/about.html HTTP/1.1" 200 1566',
    '216.39.48.10 - - [09/Jun/2003:19:57:16 -0400] "GET /~dd595/frame.htm HTTP/1.1" 400 1144'
    ]
    
    import re
    bytes_sum = 0
    for line in log_lines:
      matchobj = re.search(r'(\d+) (\d+)$', line) # last two numbers in line
      status_code = matchobj.group(1)
      bytes = matchobj.group(2)
      bytes_sum += int(bytes)                     # sum the bytes

    groups()

    If you wish to grab all the matches into a tuple rather than call them by number, use groups(). You can then read variables from the tuple, or assign groups() to named variables, as in the last line below:


    import re
    
    name = "Richard M. Nixon"
    matchobj = re.search(r'(\w+)\s+(\w)\.\s+(\w+)', name)
    name_tuple = matchobj.groups()
    print(name_tuple)    # ('Richard', 'M', 'Nixon')
    
    (first, middle, last) = matchobj.groups()
    print("%s, %s, %s" % ( first, middle, last ))

    findall() for multiple matches

    findall() with a groupless pattern Usually re tries to a match a pattern once -- and after it finds the first match, it quits searching. But we may want to find as many matches as we can -- and return the entire set of matches in a list. findall() lets us do that:


    text = "There are seven words in this sentence";
    words = re.findall(r'\w+', text)
    print(words)  # ['There', 'are', 'seven', 'words', 'in', 'this', 'sentence']

    This program prints each of the words on a separate line. The pattern \b\w+\b is applied again and again, each time to the text remaining after the last match. This pattern could be used as a word counting algorithm (we would count the elements in words), except for words with punctuation. findall() with groups When a match pattern contains more than one grouping, findall returns multiple tuples:


    text = "High: 33, low: 17"
    temp_tuples = re.findall(r'(\w+):\s+(\d+)', text)
    print(temp_tuples)                       # [('High', '33'), ('low', '17')]

    re.sub() for substitutions

    Regular expressions are used for matching so that we may inspect text. But they can also be used for substitutions, meaning that they have the power to modify text as well. This example replaces Microsoft '\r\n' line ending codes with Unix '\n'.


    text = re.sub(r'\r\n', '\n', text)

    Here's another simple example:


    string = "My name is David"
    string = re.sub('David', 'John', string)
    
    print(string)                            # 'My name is John'

    re.split() to split on a pattern

    re.split() allows to split a string on a pattern rather than


    The user enters a comma-separated list, but we don't know if they added spaces or not:

    ui = '23, 14, 7,3,9'
    numbers = re.split(r',\s*', line)
    print(numbers)            # ['23', '14', '7', '3', '9']

    matching on multi-line files

    This example opens and reads a web page (which we might have retrieved with a module like urlopen), then looks to see if the word "advisory" appears in the text. If it does, it prints the page:


    file = open('weather-ny.html')
    text = file.read()
    if re.search(r'advisory', text, re.I):
      print("weather advisory:  ", text)
    

    re.MULTILINE: ^ and $ can match at start or end of line

    We have been working with text files primarily in a line-oriented (or, in database terminology, record-oriented way, and regexes are no exception - most file data is oriented in this way. However, it can be useful to dispense with looping and use regexes to match within an entire file - read into a string variable with read(). In this example, we surely can use a loop and split() to get the info we want. But with a regex we can grab it straight from the file in one line:


    # passwd file:
    nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
    root:*:0:0:System Administrator:/var/root:/bin/sh
    daemon:*:1:1:System Services:/var/root:/usr/bin/false
    
    # python script:
    import re
    passwd_text = open('/etc/passwd').read()
    mobj =  re.search(r'^root:[^:]+:[^:]+:[^:]:([^:]+):([^:]+)', passwd_text, re.MULTILINE)
    if mobj:
        info = mobj.groups()
        print("root:  Name %s, Home Dir %s" % (info[0], info[1]))

    We can even use findall to extract all the information rfrom a file - keep in mind, this is still being done in two lines:


    import re
    passwd_text = open('/etc/passwd').read()
    lot = re.findall(r'^(\w+):[^:]+:[^:]+:[^:]+:[^:]+:([^:]+)', passwd_text, re.MULTILINE)
    
    mydict = dict(lot)
    
    print(mydict)

    re.DOTALL -- allow the wildcard (.) to match on newlines

    Normally, the wildcard doesn't match on newlines. When working with whole files, we may want to grab text that spans multiple lines, using a wildcard.


    # search file sample.txt
    some text we don't want
    ==start text==
    this is some text that we do want.
    the extracted text should continue,
    including just about any character,
    until we get to
    ==end text==
    other text we don't want
    
    # python script:
    import re
    text = open('sample.txt').read()
    matchobj = re.search(r'==start text==(.+)==end text==', text, re.DOTALL)
    print(matchobj.group(1))

    flags: re.IGNORECASE

    We can modify our matches with qualifiers called flags. The re.IGNORECASE flag will match any letters, whether upper or lowercase. In this example, extensions may be upper or lowercase - this file matcher doesn't care!


    import re
    dirlist = ('thisfile.jpg', 'thatfile.txt', 'otherfile.mpg', 'myfile.TXT')
    for file in dirlist:
      if re.search(r'\.txt$', file, re.IGNORECASE):   #'.txt' or '.TXT'
        print(file)

    The flag is passed as the third argument to search, and can also be passed to other re search methods.




    Benchmarking and Efficiency

    Efficiency: Introduction

    Runtime Efficiency refers to two things: memory efficiency (whether a lot of RAM memory is being used up during a process) and time efficiency (how long execution takes). And these are often related -- it takes time to allocate memory. As a "scripting" language, Python is more convenient, but less efficient than "programming" languages like C and Java: * Parsing, compilation and execution take place during runtime (C and Java are compiled ahead of time) * Memory is allocated based on anticipation of what your code will do at runtime (C in particular requires the developer to indicate what memory will be needed) * Python handles expanded memory requests seamlessly -- "no visible limits" (C and Java make use of "finite" resources, they do not expand indefinitely) Achieving runtime efficiency requires a tradeoff with required development time -- so we either spend more of our own (developer) time making our programs more efficient so they run faster and use less memory, or we spend less time developing our programs, and allow them to run slower (as Python handles memory allocation for us). Of course just the choice of a convenient scripting language (like Python) over a more efficient programming language (like Java or C++) itself favors rapid development, ease of use, etc. over runtime efficiency: in many applications, efficiency is not a consideration because there's plenty of memory, and enough time to get the job done. Nevertheless, advanced Python developers may be asked to increase the efficiency (faster or using less memory) of their programs -- possibly because the data has grown past anticipated limits, the program's responsibilities and complexity has been extended, or an unknown inefficiency is bogging down execution. In this section we'll discuss the more efficient container structures and ways to analyze the speed of the various units in our programs. Collections: high performance container datatypes * array: type-specific list * deque: "double-ended queue" * Counter: a counting dictionary * defaultdict: a dict with automatic default for missing keys timeit: unit timer to compare time efficiency of various Python algorithms cProfile: overall time profile of a Python program


    Benchmarking a Python function with timeit

    The timeit module provides a simple way to time blocks of Python code.


    We use timeit to help decide whether varying ways of accomplishing a task might make our programs more efficient. Here we compare execution time of four approaches to joining a range of integers into a very large string ("1-2-3-4-5...", etc.)


    from timeit import timeit
    
    # 'straight concatenation' approach
    def joinem():
        x = '1'
        for num in range(100):
            x = x + '-' + str(num)
        return x
    
    print(timeit('joinem()', setup='from __main__ import joinem', number=10000))
    
    # 0.457356929779             # setup= is discussed below
    
    
    # generator comprehension
    print(timeit('"-".join(str(n) for n in range(100))', number=10000))
    
    # 0.338698863983
    
    
    # list comprehension
    print(timeit('"-".join([str(n) for n in range(100)])', number=10000))
    
    # 0.323472976685
    
    
    # map() function
    print(timeit('"-".join(map(str, range(100)))', number=10000))
    
    # 0.160399913788

    Here map() appears to be fastest, probably because built-in functions are compiled in C. Repeating a test You can conveniently repeat a test multiple times by calling a method on the object returned from timeit(). Repetitions give you a much better idea of the time a function might take by averaging several.


    from timeit import repeat
    
    print(repeat('"-".join(map(str, range(100)))', number=10000, repeat=3))
    
    # [0.15206599235534668, 0.1909959316253662, 0.2175769805908203]
    
    
    print(repeat('"-".join([str(n) for n in range(100)])', number=10000, repeat=3))
    
    # [0.35890698432922363, 0.327725887298584, 0.3285980224609375]
    
    
    print(repeat('"-".join(map(str, range(100)))', number=10000, repeat=3))
    
    # [0.14228010177612305, 0.14016509056091309, 0.14458298683166504]

    setup= parameter for setup before a test Some tests make use of a variable that must be initialized before the test:


    print(timeit('x.append(5)', setup='x = []', number=10000))
    
    # 0.00238704681396

    Additionally, timeit() does not share the program's global namespace, so imports and even global variables must be imported if required by the test:


    print(timeit('x.append(5)', setup='import collections as cs; x = cs.deque()', number=10000))
    
    # 0.00115013122559

    Here we're testing a function, which as a global needs to be imported from the __main__ namespace:


    def testme(maxlim):
        return [ x*2 for x in range(maxlim) ]
    
    print(timeit('testme(5000)', setup='from __main__ import testme', number=10000))
    
    # 10.2637062073

    Keep in mind that a function tested in isolation may not return the same results as a function using a different dataset, or a function that is run as part of a larger program (that has allocated memory differently at the point of the function's execution). The cProfile module can test overall program execution.


    array

    The array is a type-specific list.


    The array container provides a list of a uniform type. An array's type must be specified at initialization. A uniform type makes an array more efficient than a list, which can contain any type.


    from array import array
    
    myarray = array('i', [1, 2])
    
    myarray.append(3)
    
    print(myarray)           # array('i', [1, 2, 3])
    
    print(myarray[-1])       # acts like a list
    for val in myarray:
        print(val)
    
    myarray.append(1.3)     # error

    Available array types:

    Type code C Type Python Type Minimum size in bytes
    'c' char character 1
    'b' signed char int 1
    'B' unsigned char int 1
    'u' Py_UNICODE Unicode character 2
    'h' signed short int 2
    'H' unsigned short int 2
    'i' signed int int 2
    'I' unsigned int long 2
    'l' signed long int 4
    'L' unsigned long long 4
    'f' float float 4
    'd' double float 8


    Collections: deque

    A "double-ended queue" provides fast adds/removals.


    The collections module provides a variety of specialized container types. These containers behave in a manner similer to the builtin ones with which we are familiar, but with additional functionality based around enhancing convenience and efficiency. lists are optimized for fixed-length operations, i.e., things like sorting, checking for membership, index access, etc. They are not optimized for appends, although this is of course a common use for them. A deque is designed specifically for fast adds -- to the beginning or end of the sequence:


    from collections import deque
    
    x = deque([1, 2, 3])
    
    x.append(4)               # x now [1, 2, 3, 4]
    x.appendleft(0)           # x now [0, 1, 2, 3, 4]
    
    popped = x.pop()          # removes '4' from the end
    
    popped2 = x.popleft()     # removes '1' from the start

    A deque can also be sized, in which case appends will push existing elements off of the ends:


    x = deque(['a', 'b', 'c'], 3)      # maximum size:  3
    x.append(99)                       # now: deque(['b', 'c', 99])  ('a' was pushed off of the start)
    x.appendleft(0)                    # now: deque([0, 'b', 'c'])   (99 was pushed off of the end)

    Collections: Counter

    Counter provides a counting dictionary.


    This structure inherits from dict and is designed to allow an integer count as well as a default 0 value for new keys. So instead of doing this:


    c = {}
    if 'a' in c:
        c['a'] = 0
    else:
        c['a'] = c['a'] + 1

    We can do this:


    from collections import Counter
    
    c = Counter()
    c['a'] = c['a'] + 1

    Counter also has related functions that return a list of its keys repeated that many times, as well as a list of tuples ordered by frequency:


    from collections import Counter
    
    c = Counter({'a': 2, 'b': 1, 'c': 3, 'd': 1})
    
    for key in c.elements():
        print(key, end=' ')            # c c c a a b b
    
    print(','.join(c.elements()))   # c,c,c,a,a,b,b
    
    
    print(c.most_common(2))   # [('c', 3), ('a', 2)]
                             # 2 arg says "give me the 2 most common"
    
    c.clear()                # set all counts to 0 (but not remove the keys)

    And, you can use Counter's implementation of the math operators to work with multiple counters and have them sum their values:


    c = Counter({'a': 1, 'b': 2})
    d = Counter({'a': 10, 'b': 20})
    
    print(c + d)                     # Counter({'b': 22, 'a': 11})

    Collections: defaultdict

    defaultdict is a dict that provides a default object for new keys.


    Similar to Counter, defaultdict allows for a default value if a key doesn't exist; but it will accept a function that provides a default value.


    A defaultdict with a default list value for each key

    from collections import defaultdict
    
    ddict = defaultdict(list)
    
    ddict['a'].append(1)
    ddict['b']
    
    print(ddict)                    # defaultdict(, {'a': [1], 'b': []})

    A defaultdict with a default dict value for each key

    ddict = defaultdict(dict)
    
    print(ddict['a'])         # {}    (key/value is created, assigned to 'a')
    
    print(list(ddict.keys()))       # dict_keys(['a'])
    
    ddict['a']['Z'] = 5
    ddict['b']['Z'] = 5
    ddict['b']['Y'] = 10
    
          # defaultdict(<class 'dict'>, {'a': {'Z': 5}, 'b': {'Z': 5, 'Y': 10}})

    Profiling a Python program with cProfile

    The profiler runs an entire script and times each unit (call to a function).


    If a script is running slowly it can be difficult to identify the bottleneck. timeit() may not be adequate as it times functions in isolation, and not usually with "live" data. This test program (ptest.py) deliberately pauses so that some functions run slower than others:


    import time
    
    def fast():
        print("I run fast!")
    
    
    def slow():
        time.sleep(3)
        print("I run slow!")
    
    
    def medium():
        time.sleep(0.5)
        print("I run a little slowly...")
    
    
    def main():
        fast()
        slow()
        medium()
    
    if __name__ == '__main__':
        main()

    We can profile this code thusly:


    >>> import cProfile
    >>> import ptest
    >>> cProfile.run('ptest.main()')
    I run fast!
    I run slow!
    I run a little slowly...
             8 function calls in 3.500 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    3.500    3.500 :1()
            1    0.000    0.000    0.500    0.500 ptest.py:15(medium)
            1    0.000    0.000    3.500    3.500 ptest.py:21(main)
            1    0.000    0.000    0.000    0.000 ptest.py:4(fast)
            1    0.000    0.000    3.000    3.000 ptest.py:9(slow)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            2    3.499    1.750    3.499    1.750 {time.sleep}

    According to these results, the slow() and main() functions are the biggest time users. The overall execution of the module itself is also shown. Comparing our code to the results we can see that main() is slow only because it calls slow(), so we can then focus on the obvious culprit, slow(). It's also possible to insider profiling in our script around particular function calls so we can focus our analysis.


    profile = cProfile.Profile()
    profile.enable()
    main()                         # or whatever function calls we'd prefer to focus on
    profile.disable()

    Command-line interface to cProfile


    python -m cProfile -o output.bin ptest.py

    The -m flag on any Python invocation can import a module automatically. -o directs the output to a file. The result is a binary file that can be analyzed using the pstats module (which we see results in largely the same output as run():


    >>> import pstats
    >>> p = pstats.Stats('output.bin')
    >>> p.strip_dirs().sort_stats(-1).print_stats()
    Thu Mar 20 18:32:16 2014    output.bin
    
             8 function calls in 3.501 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    3.501    3.501 ptest.py:1()
            1    0.001    0.001    0.500    0.500 ptest.py:15(medium)
            1    0.000    0.000    3.501    3.501 ptest.py:21(main)
            1    0.001    0.001    0.001    0.001 ptest.py:4(fast)
            1    0.001    0.001    3.000    3.000 ptest.py:9(slow)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            2    3.499    1.750    3.499    1.750 {time.sleep}
    
    
    <pstats.Stats instance at 0x017C9030>

    Caveat: don't optimize prematurely We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. -- Donald Knuth Common wisdom suggests that optimization should happen only once the code has reached a working, clear-to-finalized state. If you think about optimization too soon, you may do work that has to be undone later; or your optimizations may themselves get undone as you complete the functionality of your code. Note: some of these examples taken from the "Mouse vs. Python" blog.


    Python Enhancement Packages

    These packages provide varying approaches toward writing and running more efficient Python code.


    •  PyPy: a "Just in Time" compiler for Python -- can speed up almost any Python code.
    •  Cython: superset of the Python language that additionally supports calling of C functions and declaring C types -- good for building Python modules in C.
    •  Pyrex: compiler that lets you combine Python code with C data types, compiling your code into a C extension for Python.
    •  Weave: allows embedding of C code within Python code.
    •  Shed Skin: an experimental module that can translate Python code into optimized C++.

    While PyPy is a no-brainer for speeding up code, the other libraries listed here require a knowledge of C. The deepest analysis of Python will incorporate efficient C code and/or take into account its underlying C implementation. Python is written in C, and operations that we invoke in Python translate to actions being taken by the compiled C code. The most advanced Python developer will have a working knowledge in C, and study the C structures that Python employs.




    Functional Programming

    Functional Programming: Overview

    We may say that there are three commonly used styles (sometimes called paradigms) of program design: imperative/procedural, object-oriented and functional. Here we'll tackle a common problem (summing a sequence of integers, or an arithmetic series) using each style: imperative or procedural involves a series of statements along with variables that change as a result. We call these variable values the program's state.


    mysum = 0
    for counter in range(11):
        mysum = mysum + counter
    
    print(mysum)

    object-oriented uses object state to produce outcome.


    class Summer(object):
        def __init__(self):
            self.sum = 0
        def add(self, num):
            self.sum = self.sum + num
    
    s = Summer()
    for num in range(11):
        s.add(num)
    
    print(s.sum)

    functional combines pure functions to produce outcome. No state change is involved.

    print(sum(range(11)))

    A pure function is one that only handles input, output and its own variables -- it does not affect nor is it affected by global or other variables existing outside the function. Because of this "air-tightness", functional programming can be tested more reliably than the other styles. Some languages are designed around a single style or paradigm. But since Python is a "multi-paradigm" language, it is possible to use it to code in any of these styles. To employ functional programming in our own programs, we need only seek to replace imperative code with functional code, combining pure functions in ways that replicate some of the patterns that we use to iterate, summarize, compute, etc. After some experience coding in this style, you may recognize patterns for iteration, accumulation, etc. and more readily employ them in your programs, making them more predictable, testable and less prone to error. Note: Python documentation provides a solid overview of functional programming in Python.
    Mary Rose Cook provides a plain language introduction to functional programming.
    O'Reilly publishes a free e-book with a comprehensive review of functional programming by Python luminary David Mertz.


    Review: lambdas

    Lambda functions are simply inline functions -- they can be defined entirely within a single statement, within a container initialization, etc.


    Lambdas are most often used inside functions like sorted():


    # sort a list of names by last name
    names = [ 'Josh Peschko', 'Gabriel Feghali', 'Billy Woods', 'Arthur Fischer-Zernin' ]
    sortednames = sorted(names, key=lambda name:  name.split()[1])
    
    # sort a list of CSV lines by the 2nd column in the file
    slines = sorted(lines, lambda x: x.split(',')[2])

    We will see lambdas used in other functions such as map(), filter() and reduce().


    Review: list comprehensions; set comprehensions and dict comprehensions

    List, set and dict comprehensions can filter or transform sequences in a single statement.


    Functional programming (and algorithms in general) often involves the processing of sequences. List comprehensions provide a flexible way to filter and modify values within a list.


    list comprehension: return a list

    nums = [1, 2, 3, 4, 5]
    dblnums = [ val * 2 for val in nums ]
    print(dblnums)                                 # [2, 4, 6, 8, 10]
    
    print([ val * 2 for val in nums if val > 2])   # [6, 8, 10]

    set comprehension: return a set

    states = { line.split(':')[3]
               for line in open('student_db.txt').readlines()[1:] }

    dict comprehension: return a dict

    student_states = { line.split(':')[0]: line.split(':')[3]
                       for line in open('student_db.txt').readlines()[1:] }

    map() and filter() as alternatives to list comprehensions

    lthough list comprehensions have nominally replaced map() and filter(), these functions are still used in many functional programming algorithms.


    map(): apply a transformation function to each item in a sequence

    # square some integers
    sqrd = [x ** 2 for x in range(6)]    # [1, 4, 9, 16, 25]
    
    # get string lengths
    lens = list(map(len, ['some', 'words', 'to', 'get', 'lengths', 'from']))
    print(lens)    # [4, 5, 2, 3, 7, 4]

    filter(): apply a filtering function to each item in a sequence

    pos = [x for x in [-5, 2, -3, 17, 6, 4, -9] if x > 0]
    print(pos)     # [2, 17, 6, 4]

    reduce() for accumulation of values

    Like map() or filter(), reduce() applies a function to each item in a sequence, but accumulates a value as it iterates.


    It accumulates values through a second variable to its processing function. In the below examples, the accumulator is a and the current value of the iteration is x. a grows through the accumulation, as if the function were saying a = a + x or a = a * x.


    Here is our arithmetic series for integers 1-10, done with reduce():

    from functools import reduce
    def addthem(a, x):
        return a + x
    
    intsum = reduce(addthem, list(range(1, 11)))
    
    # same using a lambda
    intsum = reduce(lambda a, x: a + x, list(range(1, 11)))

    Just as easily, a factorial of integers 1-10:

    from functools import reduce
    facto = reduce(lambda a, x: a * x, list(range(1, 11)))       # 3628800

    default value


    Since reduce() has to start with a value in the accumulator, it will attempt to begin with the first element in the source list. However, if each value is being transformed before being accumulated, the first computation may result in an error:

    from functools import reduce
    strsum = reduce(lambda a, x: a + int(x), ['1', '2', '3', '4', '5'])
    
    # TypeError: cannot concatenate 'str' and 'int' objects

    This is apparently happening because python is trying to add int('1') to '' (i.e., reduce() does not see the transform and so uses an 'empty' version of the type, in this case an empty string).


    In these cases we can supply an initial value to reduce(), so it knows where to begin:

    from functools import reduce
    strsum = reduce(lambda a, x: a + int(x), ['1', '2', '3', '4', '5'], 0)

    Higher-Order Functions Any function that takes a function as an argument, or that returns a function as a return value, is a higher-order function. map(), filter(), reduce(), sorted all take functions as arguments. @properties, @staticmethod, @classmethod all take a given function as argument and return a modified function as a return value.


    any() and all(): return True based on truth of elements

    any(): return True if any elements are True

    any([1, 0, 2, 0, 3])        # True:  at least one item is True
    
    any([0, [], {}, ''])        # False: none of the items is True

    all(): return True if all elements are True

    all([1, 5, 0.0001, 1000])   # True:  all items are True
    
    all([1, 5, 9, 10, 0, 20])   # False:  one item is not True

    generators

    Generators are iterators that can calculate and generate any number of items.


    Generators behave like iterators, except that they yield a value rather than return one, and they remember the value of their variables so that the next time the class' next() method is called, it picks up where the last left off (at the point of yield). . As a generator is an iterator, next() calls the function again to produce the next item; and StopIteration causes the generator to stop iterating. (next() is called automatically by iterators like for.


    Generators are particularly useful in producing a sequence of n values, i.e. not a fixed sequence, but an unlimited sequence. In this example we have prepared a generator that generates primes up to the specified limit.

    def get_primes(num_max):
        """ prime number generator """
        candidate = 2
        found = []
        while True:
            if all(candidate % prime != 0 for prime in found):
                yield candidate
                found.append(candidate)
            candidate += 1
            if candidate >= num_max:
                raise StopIteration
    
    my_iter = get_primes(100)
    print(next(my_iter))        # 2
    print(next(my_iter))        # 3
    print(next(my_iter))        # 5
    
    for i in get_primes(100):
        print(i)



    Generators and Recursion

    Generators and generator comprehensions

    A generator is like an iterator, but may generate an indefinite number of items.


    A generator is a special kind of object that returns a succession of items, one at a time. Unlike functions that create a list of results in memory and then return the entire list (like the range() function in Python 2), generators perform lazy fetching, using up only enough memory to produce one item, returning it, and then proceeding to the next item retrieval. For example, in Python 2 range() produced a list of integers:


    import sys; print sys.version      # 2.7.10
    
    x = range(10)
    print(x)                           # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

    But in Python 3, range() produces a special range() object that can be iterated over to obtain the list:


    import sys; print sys.version      # 3.7.0
    x = range(10)
    print(x)                           # range(0,10)
    
    for el in x:
        print(el)        # 0
                         # 1
                         # 2 etc...

    It makes sense that range() should use lazy fetching, since most of the time using it we're only interested in iterating over it, one item at a time. (Strictly speaking range() is not a generator, but we can consider its behavior in that context when discussing lazy fetching.) If we do want a list of integers, we can simply pass the object to list():


    print(list(range(10)))       # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

    Since list() is an explicit call, this draws the reader's attention to the memory being allocated, in line with Python's philosophy of "explicit is better than implicit". Without this explicit call, the memory allocation might not be so clear. A generator comprehension is a list comprehension that uses lazy fetching to produce a generator object, rather than producing an entirely new list:


    convert_list = ['THIS', 'IS', 'QUITE', 'UPPER', 'CASE']
    
    lclist = [ x.lower() for x in convert_list ]   # list comprehension (square brackets)
    
    gclist = ( x.lower() for x in convert_list )   # generator comprehension (parentheses)
    
    
    print(lclist)         # ['this', 'is', 'quite', 'upper', 'case']
    
    print(gclist)         # <generator object <genexpr> at 0x10285e7d0>

    We can then iterate over the generator object to retrieve each item in turn. In Python 3, a list comprehension is just "syntactic sugar" for a generator comprehension wrapped in list():


    lclist = list(( x.lower() for x in convert_list ))

    Generator Functions

    We can create our own generator functions, which might be necessary if we don't want our list-returning function to produce the entire list in memory.


    Writing your own generator function would be useful, or even needed, if: 1) we are designing a list-producing function 2) the items from the list were coming from a generator-like source (for example, calculating prime numbers, or looping through a file and modifying each line of the file) 3) the list coming back from the function was too big to be conveniently held in memory (or too big for memory altogether). The generator function contains a new statement, yield, that returns an item produced by the function but remembers its place in the list-generating process. Here is a simplest version of a generator, containing 3 yield statements:


    def return_val():
        yield 'hello'
        yield 'world'
        yield 'wassup'
    
    for msg in return_val():
        print(msg, end=' ')      # hello world wassup
    
    x = return_val()
    print(x)                      # <generator object return_val at 0x10285e7d0>

    As with range() or a generator comprehension, a generator function produces an object that performs lazy fetching. Consider this simulation of the range() function, which generates a sequence of integers starting at 0:


    def my_range(max):
        x = 0
        while x < max:
            yield x
            x += 1
    
    xr = my_range(5)
    
    print(xr)                    # <generator object my_range at 0x10285e870>
    
    for val in my_range(5):
        print(val)               # 0 1 2 3 4
    
    print list(my_range(5))      # [0, 1, 2, 3, 4]

    Generators are particularly useful in producing a sequence of n values, i.e. not a fixed sequence, but an unlimited sequence. In this example we have prepared a generator that generates primes up to the specified limit.


    def get_primes(num_max):
        """ prime number generator """
        candidate = 2
        found = []
        while True:
            if all(candidate % prime != 0 for prime in found):
                yield candidate
                found.append(candidate)
            candidate += 1
            if candidate >= num_max:
                raise StopIteration
    
    my_iter = get_primes(100)
    print(next(my_iter))        # 2
    print(next(my_iter))        # 3
    print(next(my_iter))        # 5
    
    for i in get_primes(100):
        print(i)

    Recursive functions

    A recursive function calls itself until a condition has been reached.


    Recursive functions are appropriate for processes that iterate over a structure of an unknown "depth" of items or events. A factorial is the product of a range of numbers (1 * 2 * 3 * 4 ...).


    factorial: "linear" approach

    def factorial_linear(n):
        prod = 1
        for i in range(1, n+1):
            prod = prod * i
        return prod

    factorial: "recursive" approach

    def factorial(n):
       if n < 1:                                     # base case (reached 0):  returns
           return 1
       else:
           return_num = n * factorial(n - 1)  # recursive call
           return return_num
    
    print(factorial(5))       # 120

    Recursive functions are appropriate for processes that iterate over a structure of an unknown "depth" of items or events. Such situations could include files within a directory tree, where listing the directory is performed over and over until all directories within a tree are exhausted; or similarly, visiting links to pages within a website, where listing the links in a page is performed repeatedly. Recursion features three items: a recursive call, which is a call by the function to itself; the function process itself; and a base condition, which is the point at which the chain of recursions finally returns. A directory tree is a recursive structure in that requires the same operation (listing files in a directory) to be applied to "nodes" of unknown depth:


    Recurse through a directory tree

    import os
    
    def list_dir(this_dir):
        print(('* entering list_dir {} *'.format(this_dir)))
        for name in os.listdir(this_dir):
            pathname = os.path.join(this_dir, name)
            if os.path.isdir(pathname):
                list_dir(pathname)
            else:
                print(('  ' + name))
        print('* leaving list_dir *')   # base condition:  looping is complete
    list_dir('/Users/david/test')

    * entering list_dir /Users/david/test *
      recurse.py
    * entering list_dir /Users/david/test *
      file1
      file2
    * entering list_dir /Users/david/test/test2 *
      file3
      file4
    * leaving list_dir *
    * entering list_dir /Users/david/test/test3 *
      file5
      file6
    * leaving list_dir *
    * leaving list_dir *
    * entering list_dir /Users/david/test4 *
      file7
      file8
    * leaving list_dir *
    * leaving list_dir *

    The function process is the listing of the items in a directory and printing the files. The recursive call is the call to walk(path) at the bottom of the loop -- this is called when the directory looping encounters a directory. The base condition occurs when the file listing is completed. There are no more directories to loop through, so the function call returns.




    Algorithmic Complexity Analysis and "Big O"

    Introduction: the Coding Interview

    Coding interviews follow a consistent pattern of evaluation and success criteria.


    What interviewers are considering: Analytical Skills: how easily, how well and how efficiently did you solve a coding challenge? Coding Skills: how clear and well organized was your code, did you use proper style, and did you consider potential errors? Technical knowledge / Computer science fundamentals: how familiar are you with the technologies relevant to the position? Experience: have you built interesting projects or solved interesting problems, and have you demonstrated passion for what you are doing? Culture Fit: can you tell a joke, and can you take one? Seriously, does your personality fit in with the office or team culture? The interview process: The phone screen: 1-2 calls focusing first on your personality and cultural fit, and then on your technical skills. Some phone screens include a coding interview. The take-home exam: a coding problem that may or may not be timed. Your code may be evaluated for a number of factors: good organization and style, effective solution, efficient algorithm. The in-person interview: 1 or more onsite interviews with engineers, team lead and/or manager. If the office is out of town you may even fly there (at company's expense). Many onsite interviews are full-day, in which several stakeholders interview you in succession. The whiteboard coding interview: For various reasons most companies prefer that you write out code on a whiteboard. You should consider practicing coding challenges on a whiteboard if only just to get comfortable with the pen. Writing skills are important, particularly when writing a "pretty" (i.e., without brackets) language like Python.


    Introduction: Algorithmic Complexity Analysis

    Algorithms can be analyzed for efficiency based on how they respond to varying amounts of input data.


    Algorithm: a block of code designed for a particular purpose. You may have heard of a sort algorithm, a mapping or filtering algorithm, a computational algorithm; Google's vaunted search algorithm or Facebook's "feed" algorithm; all of these refer to the same concept -- a block of code designed for a particular purpose. Any block of code is an algorithm, including simple ones. Since algorithms can be well designed or poorly designed, time efficient or inefficient, memory efficient or inefficient, it becomes a meaningful discipline to analyze the efficiency of one approach over another. Some examples are taken from the premier text on interview questions and the coding interview process, Cracking the Coding Interview, by Gayle Laakmann McDowell. Several of the examples and information in this presentation can be found in a really clear textbook on the subject, Problem Solving with Algorithms and Data Structures, also available as a free PDF.


    The order ("growth rate") of a function

    The order describes the growth in steps of a function as the input size grows.


    A "step" can be seen as any individual statement, such as an assignment or a value comparison. Depending on its design, an algorithm may take take the same number of steps no matter how many elements are passed to input ("constant time"), an increase in steps that matches the increase in input elements ("linear growth"), or an increase that grows faster than the increase in input elements ("logarithmic", "linear logarithmic", "quadratic", etc.). Order is about growth of number of steps as input size grows, not absolute number of steps. Consider this simple file field summer. How many more steps for a file of 5 lines than a file of 10 lines (double the growth rate)? How many more for a file of 1000 lines?


    def sum_fieldnum(filename, fieldnum, delim):
        this_sum = 0.0
        fh = open(filename)
        for line in fh:
            items = line.split(delim)
            value = float(items[fieldnum])
            this_sum = this_sum + value
        fh.close()
        return this_sum

    Obviously several steps are being taken -- 5 steps that don't depend on the data size (initial assignment, opening of filehandle, closing of filehandle and return of summed value) and 3 steps taken once for each line of the file (split the line, convert item to float, add float to sum) Therefore, with varying input file sizes, we can calulate the steps:


       5 lines:  5 + (3 * 5),    or 5 + 15,   or 20 steps
      10 lines:  5 + (3 * 10),   or 5 + 30,   or 35 steps
    1000 lines:  5 + (3 * 1000), or 5 + 3000, or 3005 steps

    As you can see, the 5 "setup" steps become trivial as the input size grows -- it is 25% of the total with a 5-line file, but 0.0016% of the total with a 1000-line file, which means that we should consider only those steps that are affected by input size -- the rest are simply discarded from analysis.


    A simple algorithm: sum up a list of numbers

    Here's a simple problem that will help us understand the comparison of algorithmic approaches.


    It also happens to be an interview question I heard when I was shadowing an interview: Given a maximum value n, sum up all values from 0 to the maximum value. "range" approach:


    def sum_of_n_range(n):
        total = 0
        for i in range(1,n+1):
            total = total + i
        return total
    
    print(sum_of_n_range(10))

    "recursive" approach:


    def sum_of_n_recursive(total, count, this_max):
        total = total + count
        count += 1
        if count > this_max:
            return total
        return sum_of_n_recursive(total, count, this_max)
    
    print(sum_of_n_recursive(0, 0, 10))

    "formula" approach:


    def sum_of_n_formula(n):
        return (n * (n + 1)) // 2
    
    print(sum_of_n_formula(10))

    We can analyze the respective "order" value for each of these functions by comparing its behavior when we pass it a large vs. a small value. We count each statement as a "step". The "range" solution begins with an assignment. It loops through each consecutive integer between 1 and the maximum value. For each integer it performs a sum against the running sum, then returns the final sum. So if we call sum_of_n_range with 10, it will perform the sum (total + i) 10 times. If we call it with 1,000,000, it will perform the sum 1,000,000 times. The increase in # of steps increases in a straight line with the # of values to sum. We call this linear growth. The "recursive" solution calls itself once for each value in the input. This also requires a step increase that follows the increase in values, so it is also "linear". The "formula" solution, on the other hand, arrives at the answer through a mathematic formula. It performs an addition, multiplication and division of values, but the computation is the same regardless of the input size. So whether 10 or 1,000,000, the number of steps is the same. This is known as constant time.


    "Big O" notation

    The order of a function (the growth rate of the function as its input size grows) is expressed with a mathematical expression colloquially referred to as "Big O".


    Common function notations for Big O Here is a table of the most common growth rates, both in terms of their O notation and English names:

    "O" NotationName
    O(1)Constant
    O(log(n))Logarithmic
    O(n)Linear
    O(n * log(n))Log Linear
    O(n²)Quadratic
    O(n³)Cubic
    O(2^n) (2 to the power of n)Exponential

    Here's a graph of the constant, linear and exponential growth rates: Here's a graph of the other major scales. You can see that at this scale, "constant time" and "logarithmic" seem very close:

    Here is the wiki page for "Big O": https://en.wikipedia.org/wiki/Big_O_notation


    Constant time: O(1)

    A function that does not grow in steps or operations as the input size grows is said to be running at constant time.


    def sum_of_n_formula(n):
        return (n * (n + 1)) // 2
    
    print(sum_of_n_formula(10))

    That is, no matter how big n gets, the number of operations stays the same. "Constant time" growth is noted as O(1).


    Linear growth: O(n)

    The growth rate for the "range" solution to our earlier summing problem (repeated below) is known as linear growth.


    With linear growth, as the input size (or in this case, the integer value) grows, the number of steps or operations grows at the same rate:


    def sum_of_n_range(n):
        the_sum = 0
        for i in range(1,n+1):
            the_sum = the_sum + i
        return the_sum
    
    print(sum_of_n_range(10))

    Although there is another operation involved (the assignment of the_sum to 0), this additional step becomes trivial as the input size grows. We tend to ignore this step in our analysis because we are concerned with the function's growth, particularly as the input size becomes large. Linear growth is noted as O(n) where again, n is the input size -- growth of operations matching growth of input size.


    Logarithmic growth: O(log(n))

    A logarithm is an equation used in algebra. We can consider a log equation as the inverse of an exponential equation:


    b^c = a ("b to the power of c equals a")


    10³ = 1000   ## 10 cubed == 1000

    is considered equivalent to: logba = c ("log with base b and value a equals c")


    log101000 = 3

    A logarithmic scale is a nonlinear scale used to represent a set of values that appear on a very large scale and a potentially huge difference between values, with some relatively small values and some exponentially large values. Such a scale is needed to represent all points on a graph without minimizing the importance of the small values. Common uses of a logarithmic scale include earthquake magnitude, sound loudness, light intensity, and pH of solutions. For example, the Richter Scale of earthquake magnitude grows in absolute intensity as it moves up the scale -- 5.0 is 10 times that of 4.0; 6.0 is 10 times that of 5.0; 7.0 is 10 times that of 6.0, etc. This is known as a base 10 logarithmic scale. In other words, a base 10 logarithmic scales runs as:


    1, 10, 100, 1000, 10000, 100000, 1000000

    Logarithms in Big O notation However, the O(log(n)) ("Oh log of n") notation refers to a base 2 progression - 2 is twice that of 1, 3 is twice that of 2, etc. In other words, a base 2 logarithmic scale runs as:


    1, 2, 4, 8, 16, 32, 64

    A classic binary search algorithm on an ordered list of integers is O(log(n)). You may recognize this as the "guess a number from 1 to 100" algorithm from one of the extra credit assignments.


    def binary_search(alist, item):
        first = 0
        last = len(alist)-1
        found = False
    
        while first<=last and not found:
            midpoint = (first + last)//2
            if alist[midpoint] == item:
                found = True
            else:
                if item < alist[midpoint]:
                    last = midpoint-1
                else:
                    first = midpoint+1
    
        return found
    
    print(binary_search([1, 3, 4, 9, 11, 13], 11))
    
    print(binary_search([1, 2, 4, 9, 11, 13], 6))

    The assumption is that the search list is sorted. Note that once the algorithm decides whether the search integer is higher or lower than the current midpoint, it "discards" the other half and repeats the binary searching on the remaining values. Since the number of loops is basically n/2/2/2/2, we are looking at a logarithmic order. Hence O(log(n))


    Linear logarithmic growth: O(n * log(n))

    The basic approach of a merge sort is to halve it; loop through half of it; halve it again.


    def merge_sort(a_list):
        print(("Splitting ", a_list))
        if len(a_list) > 1:
            mid = len(a_list) // 2      # (floor division, so lop off any remainder)
            left_half = a_list[:mid]
            right_half = a_list[mid:]
    
            merge_sort(left_half)
            merge_sort(right_half)
    
            i = 0
            j = 0
            k = 0
    
            while i < len(left_half) and j < len(right_half):
                if left_half[i] < right_half[j]:
                    a_list[k] = left_half[i]
                    i = i + 1
                else:
                    a_list[k] = right_half[j]
                    j = j + 1
                k = k + 1
    
            while i < len(left_half):
                    a_list[k] = left_half[i]
                    i = i + 1
                    k = k + 1
    
            while j < len(right_half):
                    a_list[k] = right_half[j]
                    j = j + 1
                    k = k + 1
        print(("Merging ", a_list))
    
    a_list = [54, 26, 93, 17, 77, 31, 44, 55, 20]
    merge_sort(a_list)
    print(a_list)

    The output of the above can help us understand what portions of the unsorted list are being managed:


    ('Splitting ', [54, 26, 93, 17, 77, 31, 44, 55, 20])
    ('Splitting ', [54, 26, 93, 17])
    ('Splitting ', [54, 26])
    ('Splitting ', [54])
    ('Merging ', [54])
    ('Splitting ', [26])
    ('Merging ', [26])
    ('Merging ', [26, 54])
    ('Splitting ', [93, 17])
    ('Splitting ', [93])
    ('Merging ', [93])
    ('Splitting ', [17])
    ('Merging ', [17])
    ('Merging ', [17, 93])
    ('Merging ', [17, 26, 54, 93])
    ('Splitting ', [77, 31, 44, 55, 20])
    ('Splitting ', [77, 31])
    ('Splitting ', [77])
    ('Merging ', [77])
    ('Splitting ', [31])
    ('Merging ', [31])
    ('Merging ', [31, 77])
    ('Splitting ', [44, 55, 20])
    ('Splitting ', [44])
    ('Merging ', [44])
    ('Splitting ', [55, 20])
    ('Splitting ', [55])
    ('Merging ', [55])
    ('Splitting ', [20])
    ('Merging ', [20])
    ('Merging ', [20, 55])
    ('Merging ', [20, 44, 55])
    ('Merging ', [20, 31, 44, 55, 77])
    ('Merging ', [17, 20, 26, 31, 44, 54, 55, 77, 93])
    [17, 20, 26, 31, 44, 54, 55, 77, 93]

    Here's an interesting description comparing O(log(n)) to O(n * log(n)):


    log(n) is proportional to the number of digits in n.
    
    n * log(n) is n times greater.
    
    Try writing the number 1000 once versus writing it one thousand times.
    The first takes O(log(n)) time, the second takes O(n * log(n) time.
    
    Now try that again with 6700000000. Writing it once is still trivial.
    Now try writing it 6.7 billion times.
    We'll check back in a few years to see your progress.

    Quadratic growth: O(n^2;)

    O(n²) growth can best be described as "for each element in the sequence, loop through the sequence". This is why it's notated as .


    def all_combinations(the_list):
       results = []
       for item in the_list:
           for inner_item in the_list:
               results.append((item, inner_item))
       return results
    
    print(all_combinations(['a', 'b', 'c', 'd', 'e', 'f', 'g']))

    Clearly we're seeing n * n, so 49 individual tuple appends.


    [('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'e'), ('a', 'f'),
     ('a', 'g'), ('b', 'a'), ('b', 'b'), ('b', 'c'), ('b', 'd'), ('b', 'e'),
     ('b', 'f'), ('b', 'g'), ('c', 'a'), ('c', 'b'), ('c', 'c'), ('c', 'd'),
     ('c', 'e'), ('c', 'f'), ('c', 'g'), ('d', 'a'), ('d', 'b'), ('d', 'c'),
     ('d', 'd'), ('d', 'e'), ('d', 'f'), ('d', 'g'), ('e', 'a'), ('e', 'b'),
     ('e', 'c'), ('e', 'd'), ('e', 'e'), ('e', 'f'), ('e', 'g'), ('f', 'a'),
     ('f', 'b'), ('f', 'c'), ('f', 'd'), ('f', 'e'), ('f', 'f'), ('f', 'g'),
     ('g', 'a'), ('g', 'b'), ('g', 'c'), ('g', 'd'), ('g', 'e'), ('g', 'f'),
     ('g', 'g')]

    Exponential Growth: O(2^n)

    Exponential denotes an algorithm whose growth doubles with each additon to the input data set.


    One example would be the recursive calculation of a Fibonacci series


    def fibonacci(num):
        if num <= 1:
            return num
        return fibonacci(num - 2) + fibonacci(num - 1)
    
    for i in range(10):
        print(fibonacci(i), end=' ')

    Best Case, Expected Case, Worst Case

    Case analysis considers the outcome if data is ordered conveniently or inconveniently.


    For example, given a test item (an integer), search through a list of integers to see if that item's value is in the list. Sequential search (unsorted):


    def sequential_search(a_list, item):
        pos = 0
        found = False
        for test_item in a_list:
            if test_item == item:
                found = True
                break
        return found
    
    test_list = [1, 2, 32, 8, 17, 19, 42, 13, 0]
    print(sequential_search(test_list, 2))          # best case:  found near start
    print(sequential_search(test_list, 17))         # expected case:  found near middle
    print(sequential_search(test_list, 999))        # worst case:  not found

    Analysis: 0(n) Because the order of this function is linear, or O(n), case analysis is not meaningful. Whether the best or worst case, the rate of growth is the same. It is true that "best case" results in very few steps taken (closer to O(1)), but that's not helpful in understanding the function. When case matters Case analysis comes into play when we consider that an algorithm may seem to do well with one dataset (best case), not as well with another dataset (expected case), and poorly with a third dataset (worst case). A quicksort picks a random pivot, divides the unsorted list at that pivot, and sorts each sublist by selecting another pivot and dividing again.


    def quick_sort(alist):
       """ initial start """
    
       quick_sort_helper(alist, 0, len(alist) - 1)
    
    
    def quick_sort_helper(alist, first_idx, last_idx):
       """  calls partition() and retrieves a split point,
            then calls itself with '1st half' / '2nd half' indices """
    
       if first_idx < last_idx:
           splitpoint = partition(alist, first_idx, last_idx)
    
           quick_sort_helper(alist, first_idx, splitpoint - 1)
           quick_sort_helper(alist, splitpoint + 1, last_idx)
    
    
    def partition(alist, first, last):
       """ main event:  sort items to either side of a pivot value """
    
       pivotvalue = alist[first]   # very first item in the list is "pivot value"
    
       leftmark = first + 1
       rightmark = last
    
       done = False
       while not done:
    
           while leftmark <= rightmark and alist[leftmark] <= pivotvalue:
               leftmark = leftmark + 1
    
           while alist[rightmark] >= pivotvalue and rightmark >= leftmark:
               rightmark = rightmark - 1
    
           if rightmark < leftmark:
               done = True
           else:
               # swap two items
               temp = alist[leftmark]
               alist[leftmark] = alist[rightmark]
               alist[rightmark] = temp
    
       # swap two items
       temp = alist[first]
       alist[first] = alist[rightmark]
       alist[rightmark] = temp
    
       return rightmark
    
    
    alist = [54, 26, 93, 17, 77]
    quick_sort(alist)
    print(alist)
    

    Best case: all elements are equal -- sort traverses the elements once (O(n)) Worst case: the pivot is the biggest element in the list -- each iteration just works on one item at a time (O(n²)) Average case: the pivot is more or less in the middle -- O(n * log(n))


    "order" analysis example

    Let's take an arbitrary example to analyze. This algorithm is working with the variable n -- we have not defined n because it represents the input data, and our analysis will ask: how does the time needed change as n grows? However, we can assume that n is a sequence.


    a = 5                   # count these up:
    b = 6                   # 3 statements
    c = 10
    
    for k in range(n):
        w = a * k + 45      # 2 statements:
        v = b * b           # but how many times
                            # will they execute?
    
    for i in range(n):
        for j in range(n):
            x = i * i       # 3 statements:
            y = j * j       # how many times?
            z = i * j
    
    d = 33                  # 1 statement

    * We can count assignment statements that are executed once: there are 4 of these. * The 2 statements in the first loop are each being executed once for each iteration of the loop -- and it is iterating n times. So we call this 2n. * The 3 statements in the second loop are being executed n times * n times (a nested loop of range(n). We can call this ("n squared"). So the order equation can be expressed as 4 + 2n + n² eliminating the trivial factors However, remember that this analysis describes the growth rate of the algorithm as input size n grows very large. As n gets larger, the impact of 4 and of 2n become less and less significant compared to ; eventually these elements become trivial. So we eliminate the lessor factors and pay attention only to the most significant -- and our final calculation is O(n²).


    Big O Analysis: rules of thumb

    Here are some practical ways of thinking, courtesy of The Idiot's Guide to Big O


    * Does it have to go through the entire list? There will be an n in there somewhere. * Does the algorithms processing time increase at a slower rate than the size of the data set? Then there's probably a log(n) in there. * Are there nested loops? You're probably looking at n^2 or n^3. * Is access time constant irrelevant of the size of the dataset?? O(1)


    More Big O recursion examples

    These were adapted from a stackoverflow question. Just for fun(!) these are presented without answers; answers on the next page.


    def recurse1(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse1(n-1)

    def recurse2(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse2(n-5)

    def recurse3(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse3(n / 5)

    def recurse4(n, m, o):
        if n <= 0:
            print('{}, {}'.format(m, o))
        else:
            recurse4(n-1, m+1, o)
            recurse4(n-1, m, o+1)
    

    def recurse5(n):
        for i in range(n)[::2]:       # count to n by 2's (0, 2, 4, 6, 7, etc.)
            pass
        if n <= 0:
            return 1
        else:
            return 1 + recurse5(n-5)

    More Big O recursion examples: analysis

    def recurse1(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse1(n-1)

    This function is being called recursively n times before reaching the base case so it is O(n) (linear)


    def recurse2(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse2(n-5)

    This function is called n-5 for each time, so we deduct five from n before calling the function, but n-5 is also O(n) (linear).


    def recurse3(n):
        if n <= 0:
            return 1
        else:
            return 1 + recurse3(n // 2)

    This function is log(n), for every time we divide by 2 before calling the function.


    def recurse4(n, m, o):
        if n <= 0:
            print('{}, {}'.format(m, o))
        else:
            recurse4(n-1, m+1, o)
            recurse4(n-1, m, o+1)

    In this function it's O(2^n), or exponential, since each function call calls itself twice unless it has been recursed n times.


    def recurse5(n):
        for i in range(n)[::2]:       # count to n by 2's (0, 2, 4, 6, 7, etc.)
            pass
        if n <= 0:
            return 1
        else:
            return 1 + recurse5(n-5)

    The for loop takes n/2 since we're increasing by 2, and the recursion takes n-5 and since the for loop is called recursively, the time complexity is in (n-5) * (n/2) = (2n-10) * n = 2n^2- 10n, so O(n²)


    Efficiency of core Python Data Structure Algorithms

    note: "k" is the list being added/concatenated/retrieved

    List
    OperationBig-O Efficiency
    index[]O(1)
    index assignmentO(1)
    appendO(1)
    pop()O(1)
    pop(i)O(n)
    insert(i,item)O(n)
    del operatorO(n)
    iterationO(n)
    contains (in)O(n)
    get slice [x:y]O(k)
    del sliceO(n)
    set sliceO(n + k)
    reverseO(n)
    concatenateO(k)
    sortO(n * log(n)
    multiplyO(nk)



    Dict
    OperationBig-O Efficiency (avg.)
    copyO(n)
    get itemO(1)
    set itemO(1)
    delete itemO(1)
    contains (in)O(1)
    iterationO(n)
    A rundown of each structure and the O time to complete operations on each structure are noted here: https://wiki.python.org/moin/TimeComplexity


    Common Algorithms Organized by Efficiency

    O(1) time 1. Accessing Array Index (int a = ARR[5]) 2. Inserting a node in Linked List 3. Pushing and Poping on Stack 4. Insertion and Removal from Queue 5. Finding out the parent or left/right child of a node in a tree stored in Array 6. Jumping to Next/Previous element in Doubly Linked List and you can find a million more such examples... O(n) time 1. Traversing an array 2. Traversing a linked list 3. Linear Search 4. Deletion of a specific element in a Linked List (Not sorted) 5. Comparing two strings 6. Checking for Palindrome 7. Counting/Bucket Sort and here too you can find a million more such examples.... In a nutshell, all Brute Force Algorithms, or Noob ones which require linearity, are based on O(n) time complexity O(log(n)) time 1. Binary Search 2. Finding largest/smallest number in a binary search tree 3. Certain Divide and Conquer Algorithms based on Linear functionality 4. Calculating Fibonacci Numbers - Best Method The basic premise here is NOT using the complete data, and reducing the problem size with every iteration O(n * log(n)) time 1. Merge Sort 2. Heap Sort 3. Quick Sort 4. Certain Divide and Conquer Algorithms based on optimizing O(n^2) algorithms The factor of 'log(n)' is introduced by bringing into consideration Divide and Conquer. Some of these algorithms are the best optimized ones and used frequently. O(n^2) time 1. Bubble Sort 2. Insertion Sort 3. Selection Sort 4. Traversing a simple 2D array These ones are supposed to be the less efficient algorithms if their O(n * log(n)) counterparts are present. The general application may be Brute Force here.




    Practice Problems (use up- and down-arrow to advance)

    Practice Problem: List sum (linear and recursion)

    Given a list of numbers, sum them up using a linear approach and using recursion. Answers appear on next slide.


    Practice Problem: List sum (linear and recursion) (Answers)

    Given a list of numbers, sum them up using a linear approach and using recursion. linear approach


    def list_sum_linear(num_list):
        the_sum = 0
        for i in num_list:
            the_sum = the_sum + i
        return the_sum
    
    print((list_sum([1,3,5,7,9])))

    recursion approach


    def list_sum_recursive(num_list):
        if len(num_list) == 1:
            return num_list[0]
        else:
            return num_list[0] + list_sum(num_list[1:])
    
    print((list_sum([1,3,5,7,9])))

    Practice Problems: ETL Developer

    This was a question in an interview at AppNexus that I helped conduct, calling for a Python ETL developer (extract, transform, load) -- not a high-end position, but still one of value (and significant remuneration). Answers appear on next slide. Class and STDOUT data stream


    class OT(object):
        def __init__(self, *thisfile):
            self.file = thisfile
    
        def write(self, obj):
            for f in self.file:
                f.write(obj)
    
    sys.stdout = OT(sys.stdout, open('myfile.txt'))

    1. What does this code do? Feel free to talk it through 2. What is the 'object' in the parentheses? 3. What does the asterisk in *thisfile mean? local and global namespace


    var = 10
    
    def myfunc():
      var = 20
      print(var)
    
    myfunc()
    print(var)

    1. What will this code output? Why? "sort" functions and multidimensional structures


    def myfunc(arg):
        return arg
    
    struct = [ { 'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9] }, { 'a': [10, 12, 13], 'b': [1, 2, 3], 'c': [1, 2, 3] },  { 'a': [1, 2, 3], 'b': [1, 2, 3], 'c': [1, 2, 3] } ]
    
    dd = sorted(struct, key=myfunc)

    1. What type of object is arg? 2. Rewrite the 'myfunc' function so the dicts are sorted by the sum of values associated with 'c'. 3. Convert your 'myfunc' function as a lambda 4. Loop through struct and print out just the last value of each list. import statements 1. which of these import statements do you favor and why?


    import datetime
    
    import datetime as dt
    
    from datetime import *

    Practice Problems: ETL Developer (Answers)

    Class and STDOUT data stream


    class OT(object):
        def __init__(self, *thisfile):
            self.file = thisfile
    
        def write(self, obj):
            for f in self.file:
                f.write(obj)
    
    sys.stdout = OT(sys.stdout, open('myfile.txt', 'w'))

    1. What does this code do? Feel free to talk it through. The class creates an object that stores multiple open data streams (in this case, sys.stdout and an open filehandle) in an attribute of the instance. When the write() method is called on the object, the class will write to each of the stream(s) initialized in the instance, in this case to sys.stdout() and to the open file. The OT instance is being assigned to sys.stdout. This means that any call to sys.stdout.write() will pass to the instance. In addition, the print statement will also call sys.stdout.write(). The effect of this code is for any print statements that occur afterward to write to both STDOUT and to the filehandle initialized when the instance was constructed. 2. What is the 'object' in the parentheses? It causes the OT class to inherit from object. Thus OT is a new-style class. 3. What does the asterisk in *thisfile mean? It allows any number of arguments to be passed to the constructor / to __init__. local and global namespace


    var = 10
    
    def myfunc():
      var = 20
      print(var)
    
    myfunc()
    print(var)

    1. What will this code output? Why? 20 10 Inside myfunc() the local variable var is set to 20 and printed. Once returned from the function, the global var is "revealed" (i.e., it is now accessible under the name var. "sort" functions and multidimensional structures


    def myfunc(arg):
        return arg
    
    struct = [ { 'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9] }, { 'a': [10, 12, 13], 'b': [1, 2, 3], 'c': [1, 2, 3] },  { 'a': [1, 2, 3], 'b': [1, 2, 3], 'c': [1, 2, 3] } ]
    
    dd = sorted(struct, key=myfunc)

    1. What type of object is arg? arg is a string, or a key from the dict. 2. Rewrite the 'myfunc' function so the dicts are sorted by the sum of values associated with 'c'. def myfunc(arg): return sum(myfunc(arg)) 3. Convert your 'myfunc' function as a lambda lambda arg: sum(myfunc(arg)) 4. Loop through struct and print out just the last value of each list. for key in struct: print struct[key][-1] import statements 1. which of these import statements do you favor and why?


    import datetime
    
    import datetime as dt
    
    from datetime import *

    The answer should see the candidate disavowing any use of the last form, which imports all symbols from the datetime module into the global namespace, and thus risks collisions with other modules


    Practice Problems: Calculate a Factorial using linear approach and recursion

    A factorial is the multiplication of each integer in a consecutive range starting at 0. So 4 factorial is 1 * 2 * 3 * 4 = 24. In the recursive approach, the function's job is very simply to multiply the number passed to it by the product produced by another call to the function with a number one less than the one passed to it. The function thus continues to call itself with one less integer until the argument becomes 0, at which point it returns. As each recursively called function returns, it passes back the value it was passed, multiplied by the value passed to it. So on the way back the values are multiplied. Answers appear on next slide.


    Practice Problems: Calculate a Factorial using linear approach and recursion (Answers)

    factorial: "linear" approach


    def factorial_linear(n):
        prod = 1
        for i in range(1, n+1):
            prod = prod * i
        return prod

    factorial: "recursion" approach


    def factorial_recursive(n):
       if n < 1:
           return 1
       else:
           return_number = n * factorial_recursive(n-1)  # recursive call
           print('{}! = {}'.format(n, return_number))
           return return_number

    Practice Problems: Calculate a fibonacci series using linear approach and recursion

    A fibonacci series is one in which each number is the sum of the two previous numbers. Answers appear on next slide.


    Practice Problems: Calculate a fibonacci series using linear approach and recursion (Answers)

    A fibonacci series is one in which each number is the sum of the two previous numbers. linear approach


    def fib_lin(max):
    
        prev = 0
        curr = 1
    
        while curr < max:
            print(curr, end=' ')
            newcurr = prev + curr
            prev = curr
            curr = newcurr

    Practice Problems: Calculate prime numbers to a max value

    Answers appear on next slide.


    Practice Problems: Calculate prime numbers to a max value (Answer)

    def get_primes(maxval):
        startval = 2
    
        while startval <= maxval:
            counter = 2
    
            while counter < startval:
                if startval % counter == 0:
                    startval = startval + 1
                    counter = 2
                    continue
                else:
                    counter = counter + 1
                    continue
    
            print(startval)
    
            startval = startval + 1

    Practice Problems: reverse a sequence

    this_list = ['a', 'b', 'c', 'd', 'e']

    Answers appear on next slide.


    Practice Problems: reverse a sequence (Answer)

    this_list = ['a', 'b', 'c', 'd', 'e']
    
    
    # using reversed
    print(list(reversed(this_list)))
    
    # using sorted
    print(sorted(this_list, reverse=True))
    
    # using negative stride
    print(this_list[::-1])
    
    # using list.insert()
    newlist = []
    for el in this_list:
        newlist.insert(0 ,el)
    
    # using indices
    newlist = []
    index = len(this_list)-1
    while index >= 0:
        newlist.append(this_list[index])
        index = index - 1

    Practice Problems: given two strings, detect whether one is an anagram of the other

    Answers appear on next slide.


    Practice Problems: given two strings, detect whether one is an anagram of the other (Answer)

    def ispalindrome(test1, test2):
    
        test2 = list(test2)
        for char in test1:
            try:
                test2.remove(char)
            except ValueError:
                return False
        if test2:
            return False
        return True
    
    str1 = 'allabean'
    str2 = 'beallana'
    
    print(ispalindrome('allabean', 'beallana'))
    
    print(ispalindrome('allabean', 'beallaaa'))
    

    Practice Problems: given a string, detect whether it is a palindrome

    Answers appear on next slide.


    Practice Problems: given a string, detect whether it is a palindrome (Answer)

    test_string = 'Able was I ere I saw Elba'
    
    if test_string.lower() == test_string.lower()[::-1]:
        print('"{}" is a palindrome'.format(test_string))

    Python "gotcha" questions

    These really are unfair, and not necessarily a good barometer -- they simply require that you know the quirks that lead to the strange output. But they can point out interesting aspects of the language. Answers appear on next slide. for each of the following blocks of code, what is the output?


    def extendList(val, list=[]):
        list.append(val)
        return list
    
    list1 = extendList(10)
    list2 = extendList(123,[])
    list3 = extendList('a')
    
    print("list1 = %s" % list1)
    print("list2 = %s" % list2)
    print("list3 = %s" % list3)

    def multipliers():
        return [lambda x : i * x for i in range(4)]
    
    print([m(2) for m in multipliers()])

    class Parent(object):
        x = 1
    
    class Child1(Parent):
        pass
    
    class Child2(Parent):
        pass
    
    print(Parent.x, Child1.x, Child2.x)
    Child1.x = 2
    print(Parent.x, Child1.x, Child2.x)
    Parent.x = 3
    print(Parent.x, Child1.x, Child2.x)

    def div1(x,y):
        print("%s/%s = %s" % (x, y, x/y))
    
    def div2(x,y):
        print("%s//%s = %s" % (x, y, x//y))
    
    div1(5,2)
    div1(5.,2)
    div2(5,2)
    div2(5.,2.)

    1. list = [ [ ] ] * 5
    2. list  # output?
    3. list[0].append(10)
    4. list  # output?
    5. list[1].append(20)
    6. list  # output?
    7. list.append(30)
    8. list  # output?

    Python "gotcha" questions (Answers)

    for each of the following blocks of code, what is the output?


    def extendList(val, list=[]):
        list.append(val)
        return list
    
    list1 = extendList(10)
    list2 = extendList(123,[])
    list3 = extendList('a')
    
    print("list1 = %s" % list1)
    print("list2 = %s" % list2)
    print("list3 = %s" % list3)
    
    # [10, 'a']
    # [123]
    # [10, 'a']

    default list is constructed at time of definition


    def multipliers():
        return [lambda x : i * x for i in range(4)]
    
    print([m(2) for m in multipliers()])
    
    
    # [6, 6, 6, 6]

    Python closures are late binding, means that when we finally call the function it will look up the value of i and find a 6


    class Parent(object):
        x = 1
    
    class Child1(Parent):
        pass
    
    class Child2(Parent):
        pass
    
    print(Parent.x, Child1.x, Child2.x)
    Child1.x = 2
    print(Parent.x, Child1.x, Child2.x)
    Parent.x = 3
    print(Parent.x, Child1.x, Child2.x)
    
    ## 1 1 1
    ## 1 2 1
    ## 3 2 3

    Attribute lookup starts in instance, then checks class and then parent class(es).


    def div1(x,y):
        print("%s/%s = %s" % (x, y, x/y))
    
    def div2(x,y):
        print("%s//%s = %s" % (x, y, x//y))
    
    div1(5,2)
    div1(5.,2)
    div2(5,2)
    div2(5.,2.)
    
    
    ## 2
    ## 2.5
    ## 2
    ## 2

    "floor division" (i.e., integerized result) is the default with integer operands; also can be specified with the // "floor division" operator


    1. list = [ [ ] ] * 5
    2. list  # output?
    3. list[0].append(10)
    4. list  # output?
    5. list[1].append(20)
    6. list  # output?
    7. list.append(30)
    8. list  # output?
    
    
    ## [[], [], [], [], []]
    ## [[10], [10], [10], [10], [10]]
    ## [[10, 20], [10, 20], [10, 20], [10, 20], [10, 20]]
    ## [[10, 20], [10, 20], [10, 20], [10, 20], [10, 20], 30]
    

    key: in a * multiplication, Python simply duplicates the reference rather than the list