Advanced Python
Project Discussion, Session 2
Please note: notations (as used in the Introduction to Python course) are not required. IMPORTANT note on using the sqlite client The sqlite client is convenient for viewing table structure and other tasks, but it is not required. When using it, please keep in mind that your Terminal / Command Prompt session must be in the same directory as your database file, or you will create a new, empty file of the same name as the one you attempted to open. Please see the Challenge Exercises for steps needed to move your command line session to the proper data folder for these projects. |
|
2.1 | SQL to CSV: write function sql_to_csv(db_filename, table, csv_filename) that reads from a table in an sqlite database, and writes to a CSV file. |
Use the cursor .description attribute to retrieve the column names and build that into the first row of the CSV file. The result should be a .csv file with the same column heads and data as the database table. We will use session_2.db as database and read from a table named revenue, and write to revenue_from_db.csv. Please make sure to close the filehandle as well as the database connection. |
|
def sql_to_csv(dbname, table, csvname):
connect to database filename, returning a connection object
call .cursor() on the connection object, returning a cursor object
build sql query using the table_name
execute sql query on the cursor object
build a list of headers from the .description attribute
open csv file for writing
pass the file to csv.writer
write header row to csv (use writer.writerow())
for row in cursor
write row to csv using .writerow()
close csv file
close database connection
|
|
It shouldn't be necessary to develop a list of lists for writing with csv.writerows(); you can simply write to the CSV as you are iterating through the database rows. In fact, since database result sets are tuple rows and a tuple works in the same way as a list, you can write each row to the csv file with .writerow() |
|
2.2 | CREATE TABLE / DROP TABLE: write a script that creates a new table called weather_newyork in the session_2.db database with the following columns: date (should be a TEXT column), mean_temp (INT), precip (FLOAT), events (TEXT). |
connect to db create cursor object execute sql DROP TABLE IF EXISTS query execute sql CREATE TABLE query commit the changes to the database print all tables in the database (this is for testing only -- see code below) |
|
Note that this script won't do anything except remove and then create the table. You can verify that the CREATE TABLE statement has succeeded with the following code:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [ tup[0] for tup in cursor.fetchall() ]
print(f'tables: {tables}')
|
|
2.3 | CSV to SQL: Starting with the DROP TABLE and CREATE TABLE code from the previous solution, write a script that reads from weather_newyork.csv (using the csv module), selects the date, mean_temp, precip and events columns, and inserts them row-by-row into the table you created, weather_newyork. |
Please note that the "precip" column has a "T" value instead of a numeric value in some rows. Please account for this by inserting the Python value None (DO NOT USE STRING 'None' BUT RATHER THE VALUE None (no quotes) instead of the T. Of course this is the SQL to CSV assignment in reverse. Start by copying in the DROP TABLE IF EXISTS and CREATE TABLE code from the prior assignment, and test it to see that it works (you'll be able to tell that it worked if the script runs without error - if there was any problem you'd see an exception). Next, compose your INSERT query string and include parameter arguments (i.e., with ? tokens in place of values). You'll execute this query over and over but with different values inserted, so the query string should be assigned before the loop over the CSV data begins. Then open the CSV file using the csv module, next() over the header and loop through the CSV data rows, executing the INSERT query with each row's data. Don't forget to commit() at the end in order to see the rows written to the table. |
|
connect to db create cursor object execute sql DROP TABLE IF EXISTS query execute sql CREATE TABLE query initialize an INSERT query (use ? parameter arguments) open CSV file pass filehandle to csv reader next() over the header row 'for' loop over the reader assign the 4 desired column values to variables: date, temp, precip, events if the precip value is 'T', reassign as None (no quotes) execute the INSERT query with this row's data commit to finalize the inserts |
|
Finally, you'll want to issue a SELECT query and loop over the results to confirm that the rows have been inserted. Ideally, you would use the sqlite3 client to view the rows, since this gives independent confirmation. You must confirm that you have replaced T 'precip' values with None. The first row with a 'precip' value of T is on 1/9/16. . At that row in the table, the 'precip' value should be empty or NULL. (In the sqlite3 client it will show as an empty cell; selecting using Python should show None.) |
|
2.4 | JSON to SQL |
Start with the same DROP TABLE / CREATE TABLE code you used in the previous assignment to drop and then create the weather_newyork table. Also as in the previous assignment, initialize an INSERT query with parameter arguments. Open weather_newyork_dod.json and use the json module to load the file as a Python dict of dicts. Loop through the dict of dicts (where the key is a string and the value is an "inner" dict of values), and insert each "inner" dict as a row in the database by executing the INSERT query with the "inner" dict's values. Use subscripts to grab each dict value for assembly into the table. Please note that the "precip" key sometimes has a "T" value instead of a numeric value. Please account for this by inserting a None (DO NOT USE STRING 'None' BUT RATHER THE VALUE None (no quotes) Python value instead of the T. Don't forget to commit() the database connectino at the end in order to see the rows written to the table. |
|
connect to db create cursor object execute sql DROP TABLE IF EXISTS query execute sql CREATE TABLE query initialize an INSERT query (use ? parameter arguments) open JSON file load JSON file as a Python object (a dict of dicts) 'for' loop over the dict keys use dict key to get "inner" dict associated with that key retrieve values from "inner" dict and assemble into correct order for SQL table execute INSERT query with reordered values commit to finalize the inserts |
|
2.5 | (this assignment is not required, but is highly recommended, so please do it if you can) SQL to JSON dict of dicts. Build a JSON dict of dicts (similar in form to weather_newyork_dod.json) based on the revenue table in this week's session_2.db database. |
A dict of dicts must start with an empty dict, initialized before the loop begins. We will call this the "outer" dict, because we will be adding dicts to this dict. The source data is the database, so we will be SELECTing and looping through the rows of data. For each row of data we will want to build a complete "inner" dictionary that matches the row. For example, here is a sample "inner" dictionary from the first row of the data: |
|
inner = { 'name': "Haddad's",
'state': 'PA',
'cost': 239.95 }
|
|
Understand however that the above dict will be built from row data inside the database result set loop. In other words, as we loop through the database results, we build a dict based on the row values inside the loop, and then we add it to the "outer" dictionary using the company name as the key and the "inner" dict as the value. |
|
When the loop is complete, you will dump the dict of dicts to file using json.dump(). The method call should also include an indent=5 parameter to write it in a human-readable form:
json.dump(outer_dict, wfh, indent=4)
|
|
(The above assumes your dict of dicts variable is named outer_dict and your open file for writing is a variable named wfh.) |
|
connect to db create cursor object initialize an empty "outer" dict (will be built as a "dict of dicts") execute the query to select all columns from the revenue table 'for' loop over database rows assign each column value to a variable: name, state, cost build a new "inner" dict with key/value pairs matching the database row values add a new key/value pair to the "outer" dict: name as key and "inner" dict as value open a new file for writing: revenue.json use the json.dump() method to write the dict of dicts to revenue.json (use the indent=5 parameter of json.dump() to write the structure in a human-readable format; otherwise it will be written as a block of text without syntactic spaces) |
|
2.6 | (extra credit / supplementary) Parse larger JSON file. Reading the ip_routes.json file into a Python object, provide code that can answer the following three questions: |
How many keys are in the 'routes' dict of dicts, and In the 'routes' dict of dicts, what are the ip addresses that have a 'routeType' of 'iBGP'? Looking down to where the ip addresses start, and scrolling down, you'll see first that the 'routes' key points to a dict of dicts in which each key is an ip address, and the value for each key is a dict containing information in key/value pairs. Note that each ip dict has a 'routeType' key and that the value is usually 'eBGP'. There are 3 ip addresses that are associated with dicts where this value is 'iBGP'. First, we need to 'drill down' to the dict of dicts associated with 'routes'. As you can see, this is a dict nested in a series of other dicts (and one list). You should use multiple subscripts to get to this dict. As your eye travels down the structure toward 'routes', ask the question "what key or index do I need to get to the next level?" |
|
For example, to get to the 'result' list, you'll need to say:
obj['result']
|
|
(where obj is the object returned from json.load()). |
|
To get to the first dict inside this list, you'll need to use the first index in the list, so:
obj['result'][0]
|
|
Continue on from there until you have retrieved the 'routes' dict. If you put this dict into len() you should find that it has 1346 keys.
Next, we need to loop through the key/value pairs in the dict associated with 'routes'. Each key is an ip address, and each value is a dict. Now where the dict has a 'routeType' key with value 'iBGP', print the ip. You should see 3 of them.
|
|
2.7 | (extra credit / supplementary) ETL Library. Write a file called etllib.py with the following four functions for performing four transforms: SQL->CSV, SQL->JSON, CSV->SQL, JSON->SQL. |
sql2csv(query, conn) |
|
def sql2csv(query, conn):
"""
query a database and return string data in CSV format
Arguments: query string SQL query to select data
conn database connection object
Return Value: a string with CSV-formatted data
(note: to return CSV data as a string, the csv writer
must be given a StringIO object instead of an
open file object. See the io module for
details)
"""
|
|
The function must return a string; it should not write to a file. In order to use the csv module for formatting without writing to a file, we will pass to csv.writer a special "writable string" object of type StringIO, part of the io module. The StringIO object acts like a file object: you can call .write() on it, and you can also pass it to csv.writer().
import csv
import io
writestring = io.StringIO()
writer = csv.writer(writestring)
writer.writerow(['val1', 'val2', 'val3'])
writer.writerow(['val1', 'val2', 'val3'])
writestring.getvalue() # val1,val2,val3\r\nval1,val2,val3
|
|
(The \r\n above represents a newline - csv by default writes in the "Excel" dialect, which uses Windows-style newlines - however, these will also work well on Mac and Linux systems.) |
|
You'll also want to write the CSV file's headers to the file before writing data. Use the column head names retrieved from the database query through the cursor.description attribute:
c = conn.cursor()
c.execute(query) # your SELECT query
headers = [ items[0] for items in c.description ]
|
|
sql2json(query, conn, format='lod', primary_key=None) |
|
def sql2json(query, conn, format='lod', primary_key=None):
"""
query a database and return a JSON string
if format is 'lod', function will return a
list of dicts
if format is 'dod', function will return a dict
of dicts with the designated primary_key as
"outer" dict key
Arguments: query string SQL query to select data
conn database connection object
format (optional):
'lod' (list of dicts, the default) or
'dod' (dict of dicts)
primary_key (optional):
column value to use as the key
for a dict of dicts
(note that if format == 'dod' then 'primary_key'
must be an existing column name;
if format == 'lod' then 'primary_key'
must be None -- use 'is None' or 'is not None'
to test)
Return Value: string in JSON format
(note: to return a JSON string rather than
writing to a file, use the json.dumps()
method, which returns a string, instead of
the json.dump() method)
"""
|
|
This function may write to a dict of dicts (if the argument format='lod') or a list of dicts (if format='dod'). In either case, an "inner" dict must be constructed where the dict's keys are the names of the column heads, and the dict's values are the values for each row. |
|
After issuing the SELECT query but before looping through the result set rows, retrieve the table headers from the cursor object to be used as keys in the "inner" dict:
c = conn.cursor()
c.execute(query) # your SELECT query
headers = [ item[0] for item in c.description ]
|
|
As you loop through the result set, you'll have a tuple of values for each row. Here is where you will build the "inner" dict for adding to the dict of dicts or list of dicts. |
|
You can very quickly and easily build the "inner" dict by "zipping" the header list and result set row tuple and then passing the resulting structure to dict():
print(headers) # ['name', 'state']
print(result_row) # ('Joe', 'CA')
list_of_tuples = zip(headers, result_row) # [('name', 'Joe'), ('state', 'CA')]
inner_dict = dict(list_of_tuples) # {'name': 'Joe', 'state': 'CA'}
list_of_tuples = dict(zip(headers, result_row)) # same as above 2 lines
|
|
The above assumes that headers is your result set column heads retrieved from the cursor.description attribute, and result_row is a tuple of values retrieved in each iteration of 'for' looping over the result set. Please study the above example before using it - it's an extremely useful idiom for "marrying" lists of keys to lists of values in a dictionary, and it will serve you in the the future as well. |
|
Again, the function returns a JSON string - it does not write to a JSON file. To have JSON output a string, use the json.dumps() method:
jsonstr = json.dumps(dod)
|
|
csv2sql(filename, conn, table) |
|
def csv2sql(filename, conn, table):
"""
insert a csv file into a database table
Arguments: filename CSV filename to read
conn database connection object
table table to insert to
Return Value: None (writes to database)
"""
|
|
This more straightforward assignment asks you to simply loop through a CSV file row-by-row, and for each row of data in the CSV file execute an INSERT query to the database and table. |
|
As with all database inserts, you should use parameter arguments. Here is a sample query with four parameter arguments:
query = 'INSERT INTO table VALUES (?, ?, ?, ?)'
|
|
Note this form of INSERT INTO does not use column names. This will work as long as values are inserted in the same order as found in the file. |
|
In order to build an appropriate INSERT query, the number of question marks you use must match the number of columns in the CSV file (and SQL table; these are assumed to be the same). An easy way to build a string to be used in the INSERT query is to use the string .join() method with the length of the header row:
headers = ['name', 'state']
qmarks = '?' * len(headers) # '??'
parameter_tokens = ', '.join(qmarks) # '?, ?'
print(parameter_tokens) # '?, ?'
|
|
', '.join('??') treats the '??' string as a sequence of characters, "joining" them with ', '. The net effect is to insert the ', ' string between each character, in the same way that .join() inserts a string in between each list item in the usual usage. Don't forget to conn.commit() your inserts or they won't show up in the database immediately or at all. json2sql(filename, conn, table) |
|
def json2sql(filename, conn, table):
"""
insert JSON data into a database
Arguments: filename JSON file to read (assumes dict of dicts)
also assumes that "inner" dicts all have
identical keys
conn database connection object
table name of table to write to
Return Value: None (writes to database)
"""
|
|
As with all database inserts, you should use parameter arguments. Here is a sample query with four parameter arguments:
query = 'INSERT INTO table VALUES (?, ?, ?, ?)'
|
|
Note this form of INSERT INTO does not use column names. This will work as long as values are inserted in the same order as found in the file. |
|
In order to build an appropriate INSERT query, the number of question marks you use must match the number of columns in the JSON "inner" dict (and SQL table; these are assumed to be the same). An easy way to build a string to be used in the INSERT query is to use the string .join() method with the # of keys in the "inner" dict:
headers = ['name', 'state']
qmarks = '?' * len(headers) # '??'
parameter_tokens = ', '.join(qmarks) # '?, ?'
print(parameter_tokens) # '?, ?'
|
|
', '.join('??') treats the '??' string as a sequence of characters, "joining" them with ', '. The net effect is to insert the ', ' string between each character, in the same way that .join() inserts a string in between each list item in the usual usage. In order to figure out how many keys are in an "inner" dict, I suggest you access the first dict by index rather than by looping. |
|
You can get the first key in the dict of dicts by passing keys to a list and subscripting to access the first key:
first_json_key = list(dod.keys())[0]
|
|
Where dod is your dict of dicts read from json. |
|
You can then access the first "inner" dict in the dictionary by using this key in the dict:
first_inner_dict = dod[first_json_key]
|
|
Finally, use the length of the "inner" dict to build a string of question marks to be used in the INSERT query:
parameter_tokens = ', '.join('?' * len(first_inner_dict)) # '?, ?, ?'
|
|
If each "inner" dict has 3 keys, the above code builds the string shown. (See the discussion of .join() above to understand the above code.) |
|
As you loop through the dict of dicts, you'll be inserting the values from each dict into the database. If we can assume that the dict key/value pairs are in the same order as the database table columns (which we are doing for simplicity), we can simply retrieve the values from the "inner" dict in a list, and then use them in the execution of the INSERT query:
values = list(inner_dict.values())
|
|
The above assumes inner_dict is each "inner" dict in the dict of dicts. Don't forget to conn.commit() your inserts or they won't show up in the database immediately or at all. |
|