Joins Overview
Contents
Joins Overview#
So far, we’ve been using INNER JOIN
, which it’s the most common used one, but there are more. We’ll be using the old Jeff Atwood’s post about SQL joins as a base for our explanation.
Create a Test database#
Let’s create a database named test_joins
so we can add a couple of tables to explain a few more SQL joins. Open the terminal and run:
createdb -h localhost -p 5432 -U bob test_joins
Now let’s create a connection to this new database:
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/test_joins')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
'Connected: bob@test_joins'
Create Tables#
Now let’s add a couple of tables to test_joins
. The first:
%%sql
CREATE TABLE "TableA" (
id serial PRIMARY KEY,
name VARCHAR ( 50 ) UNIQUE NOT NULL
);
* postgresql://bob:***@localhost:5432/test_joins
(psycopg2.errors.DuplicateTable) relation "TableA" already exists
[SQL: CREATE TABLE "TableA" (
id serial PRIMARY KEY,
name VARCHAR ( 50 ) UNIQUE NOT NULL
);]
(Background on this error at: https://sqlalche.me/e/14/f405)
Let’s check that TableA
has been created:
%%sql
SELECT * FROM "TableA" WHERE false;
* postgresql://bob:***@localhost:5432/test_joins
0 rows affected.
id | name |
---|
Now let’s insert some data into TableA
:
%%sql
INSERT INTO
"TableA" (name)
VALUES
('Pirate'),
('Monkey'),
('Ninja'),
('Spaguetti');
* postgresql://bob:***@localhost:5432/test_joins
---------------------------------------------------------------------------
UniqueViolation Traceback (most recent call last)
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
1899 if not evt_handled:
-> 1900 self.dialect.do_execute(
1901 cursor, statement, parameters, context
1902 )
1904 if self._has_events or self.engine._has_events:
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736 cursor.execute(statement, parameters)
UniqueViolation: duplicate key value violates unique constraint "TableA_name_key"
DETAIL: Key (name)=(Pirate) already exists.
The above exception was the direct cause of the following exception:
IntegrityError Traceback (most recent call last)
Cell In[4], line 1
----> 1 get_ipython().run_cell_magic('sql', '', 'INSERT INTO\n "TableA" (name)\nVALUES\n (\'Pirate\'),\n (\'Monkey\'),\n (\'Ninja\'),\n (\'Spaguetti\');\n')
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2430, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
2428 with self.builtin_trap:
2429 args = (magic_arg_s, cell)
-> 2430 result = fn(*args, **kwargs)
2432 # The code below prevents the output from being displayed
2433 # when using magics with decodator @output_can_be_silenced
2434 # when the last Python token in the expression is a ';'.
2435 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sql/magic.py:95, in SqlMagic.execute(self, line, cell, local_ns)
92 return self._persist_dataframe(parsed['sql'], conn, user_ns)
94 try:
---> 95 result = sql.run.run(conn, parsed['sql'], self, user_ns)
97 if result is not None and not isinstance(result, str) and self.column_local_vars:
98 #Instead of returning values, set variables directly in the
99 #users namespace. Variable names given by column names
101 if self.autopandas:
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sql/run.py:340, in run(conn, sql, config, user_namespace)
338 else:
339 txt = sqlalchemy.sql.text(statement)
--> 340 result = conn.session.execute(txt, user_namespace)
341 _commit(conn=conn, config=config)
342 if result and config.feedback:
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1380, in Connection.execute(self, statement, *multiparams, **params)
1376 util.raise_(
1377 exc.ObjectNotExecutableError(statement), replace_context=err
1378 )
1379 else:
-> 1380 return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/sql/elements.py:334, in ClauseElement._execute_on_connection(self, connection, multiparams, params, execution_options, _force)
330 def _execute_on_connection(
331 self, connection, multiparams, params, execution_options, _force=False
332 ):
333 if _force or self.supports_execution:
--> 334 return connection._execute_clauseelement(
335 self, multiparams, params, execution_options
336 )
337 else:
338 raise exc.ObjectNotExecutableError(self)
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1572, in Connection._execute_clauseelement(self, elem, multiparams, params, execution_options)
1560 compiled_cache = execution_options.get(
1561 "compiled_cache", self.engine._compiled_cache
1562 )
1564 compiled_sql, extracted_params, cache_hit = elem._compile_w_cache(
1565 dialect=dialect,
1566 compiled_cache=compiled_cache,
(...)
1570 linting=self.dialect.compiler_linting | compiler.WARN_LINTING,
1571 )
-> 1572 ret = self._execute_context(
1573 dialect,
1574 dialect.execution_ctx_cls._init_compiled,
1575 compiled_sql,
1576 distilled_params,
1577 execution_options,
1578 compiled_sql,
1579 distilled_params,
1580 elem,
1581 extracted_params,
1582 cache_hit=cache_hit,
1583 )
1584 if has_events:
1585 self.dispatch.after_execute(
1586 self,
1587 elem,
(...)
1591 ret,
1592 )
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1943, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
1940 branched.close()
1942 except BaseException as e:
-> 1943 self._handle_dbapi_exception(
1944 e, statement, parameters, cursor, context
1945 )
1947 return result
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:2124, in Connection._handle_dbapi_exception(self, e, statement, parameters, cursor, context)
2122 util.raise_(newraise, with_traceback=exc_info[2], from_=e)
2123 elif should_wrap:
-> 2124 util.raise_(
2125 sqlalchemy_exception, with_traceback=exc_info[2], from_=e
2126 )
2127 else:
2128 util.raise_(exc_info[1], with_traceback=exc_info[2])
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/util/compat.py:211, in raise_(***failed resolving arguments***)
208 exception.__cause__ = replace_context
210 try:
--> 211 raise exception
212 finally:
213 # credit to
214 # https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
215 # as the __traceback__ object creates a cycle
216 del exception, replace_context, from_, with_traceback
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
1898 break
1899 if not evt_handled:
-> 1900 self.dialect.do_execute(
1901 cursor, statement, parameters, context
1902 )
1904 if self._has_events or self.engine._has_events:
1905 self.dispatch.after_cursor_execute(
1906 self,
1907 cursor,
(...)
1911 context.executemany,
1912 )
File ~/.conda/envs/sql-lab/lib/python3.10/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736 cursor.execute(statement, parameters)
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "TableA_name_key"
DETAIL: Key (name)=(Pirate) already exists.
[SQL: INSERT INTO
"TableA" (name)
VALUES
('Pirate'),
('Monkey'),
('Ninja'),
('Spaguetti');]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
Let’s check the data has been properly inserted:
%%sql
SELECT * FROM "TableA";
* postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id | name |
---|---|
1 | Pirate |
2 | Monkey |
3 | Ninja |
4 | Spaguetti |
Great, let’s do something similar for TableB
:
%%sql
CREATE TABLE "TableB" (
id serial PRIMARY KEY,
name VARCHAR ( 50 ) UNIQUE NOT NULL
);
INSERT INTO
"TableB" (name)
VALUES
('Rutabaga'),
('Pirate'),
('Darth Vader'),
('Ninja')
RETURNING *;
* postgresql://bob:***@localhost:5432/test_joins
Done.
4 rows affected.
id | name |
---|---|
1 | Rutabaga |
2 | Pirate |
3 | Darth Vader |
4 | Ninja |
INNER JOIN#
Let’s refresh our knowledge doing an INNER JOIN on the name
column:
%%sql
SELECT * FROM "TableA"
INNER JOIN "TableB"
ON "TableA".name = "TableB".name;
* postgresql://bob:***@localhost:5432/test_joins
2 rows affected.
id | name | id_1 | name_1 |
---|---|---|---|
1 | Pirate | 2 | Pirate |
3 | Ninja | 4 | Ninja |
As you can see, INNER JOIN
produces only the set of records that match in both TableA
and TableB
, meaning those rows where the column name
have the same value.
By the way, since the
INNER JOIN
statement is the default type of join, sometimes you’ll see it written just asJOIN
.
Quite often, Venn diagrams are used to explain SQL joins. The following one is supposed to describe the inner join:
FULL OUTER JOIN#
A FULL OUTER JOIN produces the set of all rows from both TableA
and Table B
where:
Rows with matching values are paired on the column we’re using for the join.
Null for those rows where the column used for the join has no match in the joined table.
Let’s see an example:
%%sql
SELECT * FROM "TableA"
FULL OUTER JOIN "TableB"
ON "TableA".name = "TableB".name;
* postgresql://bob:***@localhost:5432/test_joins
6 rows affected.
id | name | id_1 | name_1 |
---|---|---|---|
1 | Pirate | 2 | Pirate |
2 | Monkey | None | None |
3 | Ninja | 4 | Ninja |
4 | Spaguetti | None | None |
None | None | 3 | Darth Vader |
None | None | 1 | Rutabaga |
For this type of join, the following diagram is used:
Excluding all rows with common values#
What if we’re interested in those rows which have no corresponding values on the column we’re using to do the join? Check this out:
%%sql
SELECT * FROM "TableA"
FULL OUTER JOIN "TableB"
ON "TableA".name = "TableB".name
WHERE "TableA".id IS null
OR "TableB".id IS null;
* postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id | name | id_1 | name_1 |
---|---|---|---|
2 | Monkey | None | None |
4 | Spaguetti | None | None |
None | None | 3 | Darth Vader |
None | None | 1 | Rutabaga |
As you can see, we’re excluding all rows with matching values on name
. For this type of join, the following diagram is used:
LEFT OUTER JOIN#
In a LEFT OUTER JOIN all rows from TableA
are retrieved, and only the rows of Table B
with matching values on the column we’re using for the join. All the rows of TableA
which have no match will be paired with a null
on the right. For example:
%%sql
SELECT * FROM "TableA"
LEFT OUTER JOIN "TableB"
ON "TableA".name = "TableB".name;
* postgresql://bob:***@localhost:5432/test_joins
4 rows affected.
id | name | id_1 | name_1 |
---|---|---|---|
1 | Pirate | 2 | Pirate |
2 | Monkey | None | None |
3 | Ninja | 4 | Ninja |
4 | Spaguetti | None | None |
For this type of join, the following diagram is used:
Excluding all rows from the left#
What if we’re interested in those records from TableA
that have no corresponding record in TableB
? Check this out:
%%sql
SELECT * FROM "TableA"
LEFT OUTER JOIN "TableB"
ON "TableA".name = "TableB".name
WHERE "TableB".name IS null;
* postgresql://bob:***@localhost:5432/test_joins
2 rows affected.
id | name | id_1 | name_1 |
---|---|---|---|
2 | Monkey | None | None |
4 | Spaguetti | None | None |
As you can see, thanks to the use of the WHERE
clause, we’ve achieved two things:
Get the rows of
TableA
with no match inTableB
.Exclude all results from the table to the right (in this case
TableB
).
For this type of join, the following diagram is used: