Advanced Python
Project Discussion, Session 2

Please note: notations (as used in the Introduction to Python course) are not required. IMPORTANT note on using the sqlite client The sqlite client is convenient for viewing table structure and other tasks, but it is not required. When using it, please keep in mind that your Terminal / Command Prompt session must be in the same directory as your database file, or you will create a new, empty file of the same name as the one you attempted to open. Please see the Challenge Exercises for steps needed to move your command line session to the proper data folder for these projects.

2.1 SQL to CSV: write function sql_to_csv(db_filename, table, csv_filename) that reads from a table in an sqlite database, and writes to a CSV file.

Use the cursor .description attribute to retrieve the column names and build that into the first row of the CSV file. The result should be a .csv file with the same column heads and data as the database table. We will use session_2.db as database and read from a table named revenue, and write to revenue_from_db.csv. Please make sure to close the filehandle as well as the database connection.

def sql_to_csv(dbname, table, csvname):
    connect to database filename, returning a connection object
    call .cursor() on the connection object, returning a cursor object
    build sql query using the table_name
    execute sql query on the cursor object

    build a list of headers from the .description attribute

    open csv file for writing
    pass the file to csv.writer

    write header row to csv (use writer.writerow())
    for row in cursor
        write row to csv using .writerow()

    close csv file
    close database connection

It shouldn't be necessary to develop a list of lists for writing with csv.writerows(); you can simply write to the CSV as you are iterating through the database rows. In fact, since database result sets are tuple rows and a tuple works in the same way as a list, you can write each row to the csv file with .writerow()

 
2.2 CREATE TABLE / DROP TABLE: write a script that creates a new table called weather_newyork in the session_2.db database with the following columns: date (should be a TEXT column), mean_temp (INT), precip (FLOAT), events (TEXT).
connect to db
create cursor object
execute sql DROP TABLE IF EXISTS query
execute sql CREATE TABLE query
commit the changes to the database

print all tables in the database (this is for testing only -- see code below)
Note that this script won't do anything except remove and then create the table. You can verify that the CREATE TABLE statement has succeeded with the following code:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [ tup[0] for tup in cursor.fetchall() ]
print(f'tables:  {tables}')
 
2.3 CSV to SQL: Starting with the DROP TABLE and CREATE TABLE code from the previous solution, write a script that reads from weather_newyork.csv (using the csv module), selects the date, mean_temp, precip and events columns, and inserts them row-by-row into the table you created, weather_newyork.

Please note that the "precip" column has a "T" value instead of a numeric value in some rows. Please account for this by inserting the Python value None (DO NOT USE STRING 'None' BUT RATHER THE VALUE None (no quotes) instead of the T. Of course this is the SQL to CSV assignment in reverse. Start by copying in the DROP TABLE IF EXISTS and CREATE TABLE code from the prior assignment, and test it to see that it works (you'll be able to tell that it worked if the script runs without error - if there was any problem you'd see an exception). Next, compose your INSERT query string and include parameter arguments (i.e., with ? tokens in place of values). You'll execute this query over and over but with different values inserted, so the query string should be assigned before the loop over the CSV data begins. Then open the CSV file using the csv module, next() over the header and loop through the CSV data rows, executing the INSERT query with each row's data. Don't forget to commit() at the end in order to see the rows written to the table.

connect to db
create cursor object
execute sql DROP TABLE IF EXISTS query
execute sql CREATE TABLE query

initialize an INSERT query (use ? parameter arguments)

open CSV file
pass filehandle to csv reader
next() over the header row

'for' loop over the reader
    assign the 4 desired column values to variables:
    date, temp, precip, events
    if the precip value is 'T', reassign as None (no quotes)
    execute the INSERT query with this row's data

commit to finalize the inserts

Finally, you'll want to issue a SELECT query and loop over the results to confirm that the rows have been inserted. Ideally, you would use the sqlite3 client to view the rows, since this gives independent confirmation. You must confirm that you have replaced T 'precip' values with None. The first row with a 'precip' value of T is on 1/9/16. . At that row in the table, the 'precip' value should be empty or NULL. (In the sqlite3 client it will show as an empty cell; selecting using Python should show None.)

 
2.4 JSON to SQL

Start with the same DROP TABLE / CREATE TABLE code you used in the previous assignment to drop and then create the weather_newyork table. Also as in the previous assignment, initialize an INSERT query with parameter arguments. Open weather_newyork_dod.json and use the json module to load the file as a Python dict of dicts. Loop through the dict of dicts (where the key is a string and the value is an "inner" dict of values), and insert each "inner" dict as a row in the database by executing the INSERT query with the "inner" dict's values. Use subscripts to grab each dict value for assembly into the table. Please note that the "precip" key sometimes has a "T" value instead of a numeric value. Please account for this by inserting a None (DO NOT USE STRING 'None' BUT RATHER THE VALUE None (no quotes) Python value instead of the T. Don't forget to commit() the database connectino at the end in order to see the rows written to the table.

connect to db
create cursor object
execute sql DROP TABLE IF EXISTS query
execute sql CREATE TABLE query

initialize an INSERT query (use ? parameter arguments)

open JSON file
load JSON file as a Python object (a dict of dicts)

'for' loop over the dict keys
    use dict key to get "inner" dict associated with that key
    retrieve values from "inner" dict and assemble into correct order for SQL table
    execute INSERT query with reordered values

commit to finalize the inserts
 
2.5 (this assignment is not required, but is highly recommended, so please do it if you can) SQL to JSON dict of dicts. Build a JSON dict of dicts (similar in form to weather_newyork_dod.json) based on the revenue table in this week's session_2.db database.

A dict of dicts must start with an empty dict, initialized before the loop begins. We will call this the "outer" dict, because we will be adding dicts to this dict. The source data is the database, so we will be SELECTing and looping through the rows of data. For each row of data we will want to build a complete "inner" dictionary that matches the row. For example, here is a sample "inner" dictionary from the first row of the data:

inner = { 'name':  "Haddad's",
          'state': 'PA',
          'cost':  239.95 }

Understand however that the above dict will be built from row data inside the database result set loop. In other words, as we loop through the database results, we build a dict based on the row values inside the loop, and then we add it to the "outer" dictionary using the company name as the key and the "inner" dict as the value.

When the loop is complete, you will dump the dict of dicts to file using json.dump(). The method call should also include an indent=5 parameter to write it in a human-readable form:
json.dump(outer_dict, wfh, indent=4)

(The above assumes your dict of dicts variable is named outer_dict and your open file for writing is a variable named wfh.)

connect to db
create cursor object

initialize an empty "outer" dict (will be built as a "dict of dicts")

execute the query to select all columns from the revenue table

'for' loop over database rows
    assign each column value to a variable:  name, state, cost

    build a new "inner" dict with key/value pairs
    matching the database row values

    add a new key/value pair to the "outer" dict:
    name as key and "inner" dict as value

open a new file for writing:  revenue.json

use the json.dump() method to write the dict of dicts to
revenue.json (use the indent=5 parameter of
json.dump() to write the structure in a human-readable format; otherwise it will
be written as a block of text without syntactic spaces)
 
2.6 (extra credit / supplementary) Parse larger JSON file. Reading the ip_routes.json file into a Python object, provide code that can answer the following three questions:

How many keys are in the 'routes' dict of dicts, and In the 'routes' dict of dicts, what are the ip addresses that have a 'routeType' of 'iBGP'? Looking down to where the ip addresses start, and scrolling down, you'll see first that the 'routes' key points to a dict of dicts in which each key is an ip address, and the value for each key is a dict containing information in key/value pairs. Note that each ip dict has a 'routeType' key and that the value is usually 'eBGP'. There are 3 ip addresses that are associated with dicts where this value is 'iBGP'. First, we need to 'drill down' to the dict of dicts associated with 'routes'. As you can see, this is a dict nested in a series of other dicts (and one list). You should use multiple subscripts to get to this dict. As your eye travels down the structure toward 'routes', ask the question "what key or index do I need to get to the next level?"

For example, to get to the 'result' list, you'll need to say:
obj['result']

(where obj is the object returned from json.load()).

To get to the first dict inside this list, you'll need to use the first index in the list, so:
obj['result'][0]

Continue on from there until you have retrieved the 'routes' dict. If you put this dict into len() you should find that it has 1346 keys. Next, we need to loop through the key/value pairs in the dict associated with 'routes'. Each key is an ip address, and each value is a dict. Now where the dict has a 'routeType' key with value 'iBGP', print the ip. You should see 3 of them.
Besides 'routes', what are the keys in the dict associated with 'default', and what are the values associated with those keys? As above, you'll 'drill down', this time not quite as far, to the dict associated with 'default'. Loop through this dict and print out the key and value for each, EXCEPT for 'routes'. Of course, if you print the value associated with 'routs', you'll get a lot of output! So we'll omit this key/value pair from the output.

 
2.7 (extra credit / supplementary) ETL Library. Write a file called etllib.py with the following four functions for performing four transforms: SQL->CSV, SQL->JSON, CSV->SQL, JSON->SQL.

sql2csv(query, conn)

def sql2csv(query, conn):
    """
    query a database and return string data in CSV format

    Arguments:  query   string SQL query to select data
                conn    database connection object

    Return Value:  a string with CSV-formatted data

    (note:  to return CSV data as a string, the csv writer
     must be given a StringIO object instead of an
     open file object.  See the io module for
     details)

    """
The function must return a string; it should not write to a file. In order to use the csv module for formatting without writing to a file, we will pass to csv.writer a special "writable string" object of type StringIO, part of the io module. The StringIO object acts like a file object: you can call .write() on it, and you can also pass it to csv.writer().
import csv
import io

writestring = io.StringIO()
writer = csv.writer(writestring)

writer.writerow(['val1', 'val2', 'val3'])
writer.writerow(['val1', 'val2', 'val3'])

writestring.getvalue()   # val1,val2,val3\r\nval1,val2,val3

(The \r\n above represents a newline - csv by default writes in the "Excel" dialect, which uses Windows-style newlines - however, these will also work well on Mac and Linux systems.)

You'll also want to write the CSV file's headers to the file before writing data. Use the column head names retrieved from the database query through the cursor.description attribute:
c = conn.cursor()
c.execute(query)    # your SELECT query

headers = [ items[0] for items in c.description ]

sql2json(query, conn, format='lod', primary_key=None)

def sql2json(query, conn, format='lod', primary_key=None):
    """
    query a database and return a JSON string

    if format is 'lod', function will return a
    list of dicts

    if format is 'dod', function will return a dict
    of dicts with the designated primary_key as
    "outer" dict key

    Arguments:  query   string SQL query to select data
                conn    database connection object

                format (optional):
                        'lod' (list of dicts, the default) or
                        'dod' (dict of dicts)

                primary_key  (optional):
                        column value to use as the key
                        for a dict of dicts

                (note that if format == 'dod' then 'primary_key'
                 must be an existing column name;
                 if format == 'lod' then 'primary_key'
                 must be None -- use 'is None' or 'is not None'
                 to test)

    Return Value:  string in JSON format

    (note:  to return a JSON string rather than
     writing to a file, use the json.dumps()
     method, which returns a string, instead of
     the json.dump() method)

    """

This function may write to a dict of dicts (if the argument format='lod') or a list of dicts (if format='dod'). In either case, an "inner" dict must be constructed where the dict's keys are the names of the column heads, and the dict's values are the values for each row.

After issuing the SELECT query but before looping through the result set rows, retrieve the table headers from the cursor object to be used as keys in the "inner" dict:
c = conn.cursor()
c.execute(query)    # your SELECT query

headers = [ item[0] for item in c.description ]

As you loop through the result set, you'll have a tuple of values for each row. Here is where you will build the "inner" dict for adding to the dict of dicts or list of dicts.

You can very quickly and easily build the "inner" dict by "zipping" the header list and result set row tuple and then passing the resulting structure to dict():
print(headers)                                   # ['name', 'state']
print(result_row)                                # ('Joe', 'CA')

list_of_tuples = zip(headers, result_row)        # [('name', 'Joe'), ('state', 'CA')]
inner_dict = dict(list_of_tuples)                # {'name': 'Joe', 'state': 'CA'}

list_of_tuples = dict(zip(headers, result_row))  # same as above 2 lines

The above assumes that headers is your result set column heads retrieved from the cursor.description attribute, and result_row is a tuple of values retrieved in each iteration of 'for' looping over the result set. Please study the above example before using it - it's an extremely useful idiom for "marrying" lists of keys to lists of values in a dictionary, and it will serve you in the the future as well.

Again, the function returns a JSON string - it does not write to a JSON file. To have JSON output a string, use the json.dumps() method:
jsonstr = json.dumps(dod)

csv2sql(filename, conn, table)

def csv2sql(filename, conn, table):
    """
    insert a csv file into a database table

    Arguments:  filename   CSV filename to read
                conn       database connection object
                table      table to insert to

    Return Value:  None (writes to database)
    """

This more straightforward assignment asks you to simply loop through a CSV file row-by-row, and for each row of data in the CSV file execute an INSERT query to the database and table.

As with all database inserts, you should use parameter arguments. Here is a sample query with four parameter arguments:
query = 'INSERT INTO table VALUES (?, ?, ?, ?)'

Note this form of INSERT INTO does not use column names. This will work as long as values are inserted in the same order as found in the file.

In order to build an appropriate INSERT query, the number of question marks you use must match the number of columns in the CSV file (and SQL table; these are assumed to be the same). An easy way to build a string to be used in the INSERT query is to use the string .join() method with the length of the header row:
headers = ['name', 'state']
qmarks = '?' * len(headers)           # '??'
parameter_tokens = ', '.join(qmarks)  # '?, ?'
print(parameter_tokens)               # '?, ?'

', '.join('??') treats the '??' string as a sequence of characters, "joining" them with ', '. The net effect is to insert the ', ' string between each character, in the same way that .join() inserts a string in between each list item in the usual usage. Don't forget to conn.commit() your inserts or they won't show up in the database immediately or at all.

json2sql(filename, conn, table)

def json2sql(filename, conn, table):
    """
    insert JSON data into a database

    Arguments:  filename   JSON file to read (assumes dict of dicts)
                           also assumes that "inner" dicts all have
                           identical keys
                conn       database connection object
                table      name of table to write to

    Return Value:  None (writes to database)
    """
As with all database inserts, you should use parameter arguments. Here is a sample query with four parameter arguments:
query = 'INSERT INTO table VALUES (?, ?, ?, ?)'

Note this form of INSERT INTO does not use column names. This will work as long as values are inserted in the same order as found in the file.

In order to build an appropriate INSERT query, the number of question marks you use must match the number of columns in the JSON "inner" dict (and SQL table; these are assumed to be the same). An easy way to build a string to be used in the INSERT query is to use the string .join() method with the # of keys in the "inner" dict:
headers = ['name', 'state']
qmarks = '?' * len(headers)           # '??'
parameter_tokens = ', '.join(qmarks)  # '?, ?'
print(parameter_tokens)               # '?, ?'

', '.join('??') treats the '??' string as a sequence of characters, "joining" them with ', '. The net effect is to insert the ', ' string between each character, in the same way that .join() inserts a string in between each list item in the usual usage. In order to figure out how many keys are in an "inner" dict, I suggest you access the first dict by index rather than by looping.

You can get the first key in the dict of dicts by passing keys to a list and subscripting to access the first key:
first_json_key = list(dod.keys())[0]

Where dod is your dict of dicts read from json.

You can then access the first "inner" dict in the dictionary by using this key in the dict:
first_inner_dict = dod[first_json_key]
Finally, use the length of the "inner" dict to build a string of question marks to be used in the INSERT query:
parameter_tokens = ', '.join('?' * len(first_inner_dict))   # '?, ?, ?'

If each "inner" dict has 3 keys, the above code builds the string shown. (See the discussion of .join() above to understand the above code.)

As you loop through the dict of dicts, you'll be inserting the values from each dict into the database. If we can assume that the dict key/value pairs are in the same order as the database table columns (which we are doing for simplicity), we can simply retrieve the values from the "inner" dict in a list, and then use them in the execution of the INSERT query:
values = list(inner_dict.values())

The above assumes inner_dict is each "inner" dict in the dict of dicts. Don't forget to conn.commit() your inserts or they won't show up in the database immediately or at all.

 
[pr]