import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/dvdrental')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
'Connected: bob@dvdrental'

SELF JOIN#

There may be a situation where we may want to combine rows in a table with other rows in the same table, by using a join statement. To join a table to itself, we must use a table alias in order to tell apart the left table from the right table (they’re both the same table).

Example: Comparing rows in the same table#

Let’s say we want to compare those customers whose first names are equal to other customers last names:

%%sql
SELECT
    a.customer_id, a.first_name, a.last_name, b.customer_id, b.first_name, b.last_name
FROM customer AS a
JOIN customer AS b
ON a.first_name = b.last_name
LIMIT 10;
 * postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
65 Rose Howard 157 Darlene Rose
67 Kelly Torres 74 Denise Kelly
118 Kim Cruz 233 Lillie Kim
230 Joy George 307 Joseph Joy
253 Terry Carlson 265 Jennie Terry
299 James Gannon 71 Kathy James
301 Robert Baughman 331 Eric Robert
308 Thomas Grigsby 12 Nancy Thomas
312 Mark Rinehart 336 Joshua Mark
314 George Linton 230 Joy George

Instead of using the JOIN statment explicitely, we could write the query above as follows;

%%sql
SELECT
    a.customer_id, a.first_name, a.last_name, b.customer_id, b.first_name, b.last_name
FROM customer AS a, customer AS b
WHERE a.first_name = b.last_name
LIMIT 10;
 * postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
65 Rose Howard 157 Darlene Rose
67 Kelly Torres 74 Denise Kelly
118 Kim Cruz 233 Lillie Kim
230 Joy George 307 Joseph Joy
253 Terry Carlson 265 Jennie Terry
299 James Gannon 71 Kathy James
301 Robert Baughman 331 Eric Robert
308 Thomas Grigsby 12 Nancy Thomas
312 Mark Rinehart 336 Joshua Mark
314 George Linton 230 Joy George

In the query above we specify the same table twice, separated by a comma (instead of a JOIN) and aliased to a and b. For this reason, instead of using ON to provide the condition, we do it using a WHERE clause.

Example 2: Querying Hyerarchical data using a Self Join#

A common SQL code interview example is the one where we’re presented with a single employee table, which employees and managers (they all are employees, but some manage others). Check the following diagram for clarification:

self join

The table containing the employees above would be:

id

first_name

last_name

manager_id

1

Windy

Hays

NULL

2

Ava

Christensen

1

3

Hassan

Conner

1

4

Anna

Reeves

2

5

Sau

Norman

2

6

Kelsie

Hays

3

7

Tory

Goff

3

8

Salley

Lester

3

As you can see, the Windy Hays row shows a manager_id with the value NULL, because she’s the top dog 🐶.

Creating a New Database and Table#

Let’s create a new database (let’s call it acme) for demonstrating this query:

createdb -h localhost -p 5432 -U bob acme

Then we have to create a connection to it:

import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/acme')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
'Connected: bob@acme'

Now let’s create the employee table, and insert some rows in it:

%%sql
CREATE TABLE employee (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR (255) NOT NULL,
    last_name VARCHAR (255) NOT NULL,
    manager_id INT
);

INSERT INTO employee (
    employee_id,
    first_name,
    last_name,
    manager_id
)
VALUES
    (1, 'Windy', 'Hays', NULL),
    (2, 'Ava', 'Christensen', 1),
    (3, 'Hassan', 'Conner', 1),
    (4, 'Anna', 'Reeves', 2),
    (5, 'Sau', 'Norman', 2),
    (6, 'Kelsie', 'Hays', 3),
    (7, 'Tory', 'Goff', 3),
    (8, 'Salley', 'Lester', 3);
 * postgresql://bob:***@localhost:5432/acme
   postgresql://bob:***@localhost:5432/dvdrental
(psycopg2.errors.DuplicateTable) relation "employee" already exists

[SQL: CREATE TABLE employee (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR (255) NOT NULL,
    last_name VARCHAR (255) NOT NULL,
    manager_id INT
);]
(Background on this error at: https://sqlalche.me/e/14/f405)

Let’s say we want to pull data which contains:

  • Employee full name

  • Full name of his/her manager

%%sql
SELECT
    e.first_name || ' ' || e.last_name AS employee,
    m.first_name || ' ' || m.last_name AS manager
FROM
    employee e
JOIN employee m ON m.employee_id = e.manager_id
ORDER BY manager;
 * postgresql://bob:***@localhost:5432/acme
   postgresql://bob:***@localhost:5432/dvdrental
7 rows affected.
employee manager
Sau Norman Ava Christensen
Anna Reeves Ava Christensen
Salley Lester Hassan Conner
Kelsie Hays Hassan Conner
Tory Goff Hassan Conner
Ava Christensen Windy Hays
Hassan Conner Windy Hays

As you can see, the employee column doesn’t contain the name of the head honcho, Windy Hays.