Joins Overview#

So far, we’ve been using INNER JOIN, which it’s the most common used one, but there are more. We’ll be using the old Jeff Atwood’s post about SQL joins as a base for our explanation.

Create a Test database#

Let’s create a database named test_joins so we can add a couple of tables to explain a few more SQL joins. Open the terminal and run:

createdb -h localhost -p 5432 -U bob test_joins

Now let’s create a connection to this new database:

import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/test_joins')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
'Connected: bob@test_joins'

Create Tables#

Now let’s add a couple of tables to test_joins. The first:

%%sql
CREATE TABLE "TableA" (
    id serial PRIMARY KEY,
	name VARCHAR ( 50 ) UNIQUE NOT NULL
);
 * postgresql://bob:***@localhost:5432/test_joins
(psycopg2.errors.DuplicateTable) relation "TableA" already exists

[SQL: CREATE TABLE "TableA" (
    id serial PRIMARY KEY,
	name VARCHAR ( 50 ) UNIQUE NOT NULL
);]
(Background on this error at: https://sqlalche.me/e/14/f405)

Let’s check that TableA has been created:

%%sql
SELECT * FROM "TableA" WHERE false;
 * postgresql://bob:***@localhost:5432/test_joins
0 rows affected.
id name

Now let’s insert some data into TableA:

%%sql
INSERT INTO
    "TableA" (name)
VALUES
    ('Pirate'),
    ('Monkey'),
    ('Ninja'),
    ('Spaguetti');
 * postgresql://bob:***@localhost:5432/test_joins
---------------------------------------------------------------------------
UniqueViolation                           Traceback (most recent call last)
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1899     if not evt_handled:
-> 1900         self.dialect.do_execute(
   1901             cursor, statement, parameters, context
   1902         )
   1904 if self._has_events or self.engine._has_events:

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

UniqueViolation: duplicate key value violates unique constraint "TableA_name_key"
DETAIL:  Key (name)=(Pirate) already exists.


The above exception was the direct cause of the following exception:

IntegrityError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 get_ipython().run_cell_magic('sql', '', 'INSERT INTO\n    "TableA" (name)\nVALUES\n    (\'Pirate\'),\n    (\'Monkey\'),\n    (\'Ninja\'),\n    (\'Spaguetti\');\n')

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2430, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2428 with self.builtin_trap:
   2429     args = (magic_arg_s, cell)
-> 2430     result = fn(*args, **kwargs)
   2432 # The code below prevents the output from being displayed
   2433 # when using magics with decodator @output_can_be_silenced
   2434 # when the last Python token in the expression is a ';'.
   2435 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sql/magic.py:95, in SqlMagic.execute(self, line, cell, local_ns)
     92     return self._persist_dataframe(parsed['sql'], conn, user_ns)
     94 try:
---> 95     result = sql.run.run(conn, parsed['sql'], self, user_ns)
     97     if result is not None and not isinstance(result, str) and self.column_local_vars:
     98         #Instead of returning values, set variables directly in the
     99         #users namespace. Variable names given by column names
    101         if self.autopandas:

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sql/run.py:340, in run(conn, sql, config, user_namespace)
    338 else:
    339     txt = sqlalchemy.sql.text(statement)
--> 340     result = conn.session.execute(txt, user_namespace)
    341 _commit(conn=conn, config=config)
    342 if result and config.feedback:

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1380, in Connection.execute(self, statement, *multiparams, **params)
   1376     util.raise_(
   1377         exc.ObjectNotExecutableError(statement), replace_context=err
   1378     )
   1379 else:
-> 1380     return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/sql/elements.py:334, in ClauseElement._execute_on_connection(self, connection, multiparams, params, execution_options, _force)
    330 def _execute_on_connection(
    331     self, connection, multiparams, params, execution_options, _force=False
    332 ):
    333     if _force or self.supports_execution:
--> 334         return connection._execute_clauseelement(
    335             self, multiparams, params, execution_options
    336         )
    337     else:
    338         raise exc.ObjectNotExecutableError(self)

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1572, in Connection._execute_clauseelement(self, elem, multiparams, params, execution_options)
   1560 compiled_cache = execution_options.get(
   1561     "compiled_cache", self.engine._compiled_cache
   1562 )
   1564 compiled_sql, extracted_params, cache_hit = elem._compile_w_cache(
   1565     dialect=dialect,
   1566     compiled_cache=compiled_cache,
   (...)
   1570     linting=self.dialect.compiler_linting | compiler.WARN_LINTING,
   1571 )
-> 1572 ret = self._execute_context(
   1573     dialect,
   1574     dialect.execution_ctx_cls._init_compiled,
   1575     compiled_sql,
   1576     distilled_params,
   1577     execution_options,
   1578     compiled_sql,
   1579     distilled_params,
   1580     elem,
   1581     extracted_params,
   1582     cache_hit=cache_hit,
   1583 )
   1584 if has_events:
   1585     self.dispatch.after_execute(
   1586         self,
   1587         elem,
   (...)
   1591         ret,
   1592     )

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1943, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1940             branched.close()
   1942 except BaseException as e:
-> 1943     self._handle_dbapi_exception(
   1944         e, statement, parameters, cursor, context
   1945     )
   1947 return result

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:2124, in Connection._handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   2122     util.raise_(newraise, with_traceback=exc_info[2], from_=e)
   2123 elif should_wrap:
-> 2124     util.raise_(
   2125         sqlalchemy_exception, with_traceback=exc_info[2], from_=e
   2126     )
   2127 else:
   2128     util.raise_(exc_info[1], with_traceback=exc_info[2])

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/util/compat.py:211, in raise_(***failed resolving arguments***)
    208     exception.__cause__ = replace_context
    210 try:
--> 211     raise exception
    212 finally:
    213     # credit to
    214     # https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
    215     # as the __traceback__ object creates a cycle
    216     del exception, replace_context, from_, with_traceback

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1898                 break
   1899     if not evt_handled:
-> 1900         self.dialect.do_execute(
   1901             cursor, statement, parameters, context
   1902         )
   1904 if self._has_events or self.engine._has_events:
   1905     self.dispatch.after_cursor_execute(
   1906         self,
   1907         cursor,
   (...)
   1911         context.executemany,
   1912     )

File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "TableA_name_key"
DETAIL:  Key (name)=(Pirate) already exists.

[SQL: INSERT INTO
    "TableA" (name)
VALUES
    ('Pirate'),
    ('Monkey'),
    ('Ninja'),
    ('Spaguetti');]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

Let’s check the data has been properly inserted:

%%sql
SELECT * FROM "TableA";
 * postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id name
1 Pirate
2 Monkey
3 Ninja
4 Spaguetti

Great, let’s do something similar for TableB:

%%sql
CREATE TABLE "TableB" (
    id serial PRIMARY KEY,
	name VARCHAR ( 50 ) UNIQUE NOT NULL
);

INSERT INTO
    "TableB" (name)
VALUES
    ('Rutabaga'),
    ('Pirate'),
    ('Darth Vader'),
    ('Ninja')
RETURNING *;
 * postgresql://bob:***@localhost:5432/test_joins
Done.
4 rows affected.
id name
1 Rutabaga
2 Pirate
3 Darth Vader
4 Ninja

INNER JOIN#

Let’s refresh our knowledge doing an INNER JOIN on the name column:

%%sql
SELECT * FROM "TableA"
INNER JOIN "TableB"
ON "TableA".name = "TableB".name;
 * postgresql://bob:***@localhost:5432/test_joins
2 rows affected.
id name id_1 name_1
1 Pirate 2 Pirate
3 Ninja 4 Ninja

As you can see, INNER JOIN produces only the set of records that match in both TableA and TableB, meaning those rows where the column name have the same value.

By the way, since the INNER JOIN statement is the default type of join, sometimes you’ll see it written just as JOIN.

Quite often, Venn diagrams are used to explain SQL joins. The following one is supposed to describe the inner join:

inner join

FULL OUTER JOIN#

A FULL OUTER JOIN produces the set of all rows from both TableA and Table B where:

  • Rows with matching values are paired on the column we’re using for the join.

  • Null for those rows where the column used for the join has no match in the joined table.

Let’s see an example:

%%sql
SELECT * FROM "TableA"
FULL OUTER JOIN "TableB"
ON "TableA".name = "TableB".name;
 * postgresql://bob:***@localhost:5432/test_joins
6 rows affected.
id name id_1 name_1
1 Pirate 2 Pirate
2 Monkey None None
3 Ninja 4 Ninja
4 Spaguetti None None
None None 3 Darth Vader
None None 1 Rutabaga

Note that we’re getting None instead of null, because we’re using Alchemy to connect to the database.

For this type of join, the following diagram is used:

full outer join

Excluding all rows with common values#

What if we’re interested in those rows which have no corresponding values on the column we’re using to do the join? Check this out:

%%sql
SELECT * FROM "TableA"
FULL OUTER JOIN "TableB"
ON "TableA".name = "TableB".name
WHERE "TableA".id IS null
OR "TableB".id IS null;
 * postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id name id_1 name_1
2 Monkey None None
4 Spaguetti None None
None None 3 Darth Vader
None None 1 Rutabaga

As you can see, we’re excluding all rows with matching values on name. For this type of join, the following diagram is used:

full outer_where join

LEFT OUTER JOIN#

In a LEFT OUTER JOIN all rows from TableA are retrieved, and only the rows of Table B with matching values on the column we’re using for the join. All the rows of TableA which have no match will be paired with a null on the right. For example:

%%sql
SELECT * FROM "TableA"
LEFT OUTER JOIN "TableB"
ON "TableA".name = "TableB".name;
 * postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id name id_1 name_1
1 Pirate 2 Pirate
2 Monkey None None
3 Ninja 4 Ninja
4 Spaguetti None None

For this type of join, the following diagram is used:

left outer join

Excluding all rows from the left#

What if we’re interested in those records from TableA that have no corresponding record in TableB? Check this out:

%%sql
SELECT * FROM "TableA"
LEFT OUTER JOIN "TableB"
ON "TableA".name = "TableB".name
WHERE "TableB".name IS null;
 * postgresql://bob:***@localhost:5432/test_joins
2 rows affected.
id name id_1 name_1
2 Monkey None None
4 Spaguetti None None

As you can see, thanks to the use of the WHERE clause, we’ve achieved two things:

  • Get the rows of TableA with no match in TableB.

  • Exclude all results from the table to the right (in this case TableB).

For this type of join, the following diagram is used:

left outer_where join