Python 3

home

SQL Part 3: Primary Key, JOIN, GROUP BY, ORDER BY

Reminder: sqlite3 client column formatting

Use these sqlite3 commands to format your output readably.

At the start of your session, issue the following two commands -- these will format your sqlite3 output so it is clearer, and add columns headers.

sqlite> .mode column
sqlite> .headers on

Now output is clearly lined up with column heads displayed:

sqlite> SELECT * FROM revenue;

    # company     state       price
    # ----------  ----------  ----------
    # Haddad's    PA          239.5
    # Westfield   NJ          53.9
    # The Store   NJ          211.5
    # Hipster's   NY          11.98
    # Dothraki F  NY          5.98
    # Awful's     PA          23.95
    # The Clothi  NY          115.2

However, note that each column is only 10 characters wide. It is possible to change these widths although not usually necessary.

Primary Key (PK) in a Table

The PK is defined in the CREATE TABLE definition.

A primary key is a unique value that can be used to identify a table row. These are usually id numbers, such as a product id, county id, domain ip, etc. Since it identifies a unique row in a table, the primary key in a database cannot be duplicated -- an error will occur if this is attempted.

Here's a table description in SQLite for a table that has a "instructor_id" primary key:

CREATE TABLE instructors ( instructor_id INT PRIMARY KEY,
                           password TEXT,
                           first_name TEXT,
                           last_name TEXT   );

JOINing Tables on a Primary Key

Two tables may have info keyed to the same primary key -- these can be joined into one table.

Relational database designs attempt to separate data into individual tables, in order to avoid repetition. For example, consider one table that holds data for instructors at a school (in which one instructor appears per row) and another that holds records of a instructor's teaching a class (in which the same instructor may appear in multiple rows).

Here is a CREATE TABLE description for tables instructors and instructor_classes. instructors contains:

sqlite3> .schema instructors
CREATE TABLE instructors ( instructor_id INT PRIMARY KEY,
                           password TEXT,
                           first_name TEXT,
                           last_name TEXT   );

sqlite3> .schema instructor_classes
CREATE TABLE instructor_classes ( instructor_id INT,
                                  class_name TEXT,
                                  day TEXT );

Select all rows from both tables:

sqlite3> SELECT * from instructors
instructor_id  password    first_name  last_name
-------------  ----------  ----------  ----------
1              pass1       David       Blaikie
2              pass2       Joe         Wilson
3              xxyx        Jenny       Warner
4              yyyy        Xavier      Yellen

sqlite> SELECT * from instructor_classes
instructor_id  class_name    day
-------------  ------------  ----------
1              Intro Python  Thursday
1              Advanced Pyt  Monday
2              php           Monday
2              js            Tuesday
3              sql           Wednesday
3              mongodb       Thursday
99             Golang        Saturday

Why is instructor_classes data separated from instructors data? If we combined all of this data into one table, there would be repetition -- we'd see the instructor's name repeated on all the rows that indicate the instructor's class assignments. So it makes sense to separate the data that has a "one-to-one" relationship of instructors to the data for each instructor (as in the instructors table) from the data that has a "many-to-one" relationship of the instructor to the data for each instructor (as in the instructor_classes table). But there are times where we will want to see all of this data shown together in a single result set -- we may see repetition, but we won't be storing repetition. We can create these combined result sets using database joins.

LEFT JOIN

all rows from "left" table, and matching rows in right table

A left join includes primary keys from the "left" table (this means the table mentioned in the FROM statement) and will include only those rows in right table that share those same keys.

sqlite3> SELECT * FROM instructors LEFT JOIN instructor_classes
         on instructors.instructor_id = instructor_classes.instructor_id;

instructor_id  password    first_name  last_name   instructor_id  class_name       day
-------------  ----------  ----------  ----------  -------------  ---------------  ----------
1              pass1       David       Blaikie     1              Advanced Python  Monday
1              pass1       David       Blaikie     1              Intro Python     Thursday
2              pass2       Joe         Wilson      2              js               Tuesday
2              pass2       Joe         Wilson      2              php              Monday
3              xxyx        Jenny       Warner      3              mongodb          Thursday
3              xxyx        Jenny       Warner      3              sql              Wednesday
4              yyyy        Xavier      Yellen

Note the missing data on the right half of the last line. The right table instructor_classes had no data for instructor id 4.

RIGHT JOIN

all rows from the "right" table, and matching rows in the left table

A right join includes primary keys from the "right" table (this means the table mentioned in the JOIN clause) and will include only those rose in the left table that share the same keys as those in the right.

Unfortunately, SQLite does not support RIGHT JOIN (although many other databases do). The workaround is to use a LEFT JOIN and reverse the table names.

sqlite3> SELECT * FROM instructor_classes LEFT JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;

instructor_id  class_name    day         instructor_id  password    first_name  last_name
-------------  ------------  ----------  -------------  ----------  ----------  ----------
1              Intro Python  Thursday    1              pass1       David       Blaikie
1              Advanced Pyt  Monday      1              pass1       David       Blaikie
2              php           Monday      2              pass2       Joe         Wilson
2              js            Tuesday     2              pass2       Joe         Wilson
3              sql           Wednesday   3              xxyx        Jenny       Warner
3              mongodb       Thursday    3              xxyx        Jenny       Warner
99             Golang        Saturday

Now only rows that appear in instructor_classes appear in this table, and data not found in instructors is missing (In this case, Golang has no instructor and it is given the default id 99).

INNER JOIN and OUTER JOIN

Select only PKs common to both tables, or all PKs for all tables

INNER JOIN: rows common to both tables

An inner join includes only those rows that have primary key values that are common to both tables:

sqlite3> SELECT * from instructor_classes INNER JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;
instructor_id  class_name    day         instructor_id  password    first_name  last_name
-------------  ------------  ----------  -------------  ----------  ----------  ----------
1              Intro Python  Thursday    1              pass1       David       Blaikie
1              Advanced Pyt  Monday      1              pass1       David       Blaikie
2              php           Monday      2              pass2       Joe         Wilson
2              js            Tuesday     2              pass2       Joe         Wilson
3              sql           Wednesday   3              xxyx        Jenny       Warner
3              mongodb       Thursday    3              xxyx        Jenny       Warner

Rows are joined where both instructors and instructor_classes have data. OUTER JOIN: all rows from both tables

An outer join includes all rows from both tables, regardless of whether a PK id appears in the other table. Here's what the query would be if sqlite3 supported outer joins:

SELECT * from instructor_classes OUTER JOIN instructors ON instructors.instructor_id = instructor_classes.instructor_id;

unfortunately, OUTER JOIN is not currently supported in sqlite3. In these cases it's probably best to use another approach, i.e. built-in Python or pandas merge() (to come).

Aggregating data with GROUP BY

"Aggregation" means counting, summing or otherwise summarizing multiple values based on a common key.

Consider summing up a count of voters by their political affiliation (2m Democrats, 2m Republicans, .3m Independents), a sum of revenue of companies by their sector (manufacturing, services, etc.), or an average GPA by household income. All of these require taking into account the individual values of multiple rows and compiling some sort of summary value based on those values.

Here is a sample that we'll play with:

sqlite3> SELECT date, name, rev FROM companyrev;

date        name         rev
----------  -----------  ----------
2019-01-03  Alpha Corp.  10
2019-01-05  Alpha Corp.  20
2019-01-03  Beta Corp.   5
2019-01-07  Beta Corp.   7
2019-01-09  Beta Corp.   3

If we wish to sum up values by company, we can say it easily:

sqlite3> SELECT name, sum(rev) FROM companyrev GROUP BY name;

name         sum(rev)
-----------  ----------
Alpha Corp.  30
Beta Corp.   15

If we wish to count the number entries for each company, we can say it just as easily:

sqlite3> SELECT name, count(name) FROM companyrev GROUP BY name;

name         count(name)
-----------  -----------
Alpha Corp.  2
Beta Corp.   3

Sorting a Result Set with ORDER BY

This is SQL's way of sorting results.

The ORDER BY clause indicates a single column, or multiple columns, by which we should order our results:

sqlite3> SELECT name, rev FROM companyrev ORDER BY rev;

name        rev
----------  ----------
Beta Corp.  3
Beta Corp.  5
Beta Corp.  7
Alpha Corp  10
Alpha Corp  20

[pr]