The GROUP BY clause
Contents
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/dvdrental')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
'Connected: bob@dvdrental'
The GROUP BY clause#
The GROUP BY clause is used to group together the rows in returned by the SELECT
statement into groups. That’s useful because we can apply to each group an aggregate function, such as calculate the sum
of items in a group, and count
the number of items in another group. The basic syntax looks like this:
SELECT column_1, aggregate_function(column_2)
FROM weather_table
GROUP BY column_1;
To understand how this works, let’s take a peek into the payment
table:
%%sql
SELECT *
FROM payment
ORDER BY customer_id ASC
LIMIT 5;
* postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
payment_id | customer_id | staff_id | rental_id | amount | payment_date |
---|---|---|---|---|---|
18495 | 1 | 1 | 1185 | 5.99 | 2007-02-14 23:22:38.996577 |
18496 | 1 | 2 | 1422 | 0.99 | 2007-02-15 16:31:19.996577 |
18497 | 1 | 2 | 1476 | 9.99 | 2007-02-15 19:37:12.996577 |
18498 | 1 | 1 | 1725 | 4.99 | 2007-02-16 13:47:23.996577 |
18499 | 1 | 1 | 2308 | 4.99 | 2007-02-18 07:10:14.996577 |
As you can see, the customer with id 1
has made several payments of different amount. What if we wanted to find out the average amount paid by the first 10 customers:
%%sql
SELECT customer_id, avg(amount) AS average_amount
FROM payment
GROUP BY customer_id
LIMIT 10;
* postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id | average_amount |
---|---|
1 | 3.8233333333333333 |
2 | 4.7592307692307692 |
3 | 5.4483333333333333 |
4 | 3.7172727272727273 |
5 | 3.8471428571428571 |
6 | 3.3900000000000000 |
7 | 4.6685714285714286 |
8 | 3.7291304347826087 |
9 | 3.9400000000000000 |
10 | 3.9483333333333333 |
As you can see, we’ve use the AS
clause to set up a column alias for the value return by the aggregate function.
Using GROUP BY without an aggregate function#
Even though most of the times we’ll be using GROUP BY with an aggregate function, let’s take a look at how to use it without one, to better understand what it does. As we saw before, each customer has several payments associated with her id. For example, let’s count the amount of rows in the payment
table:
%%sql
SELECT count(customer_id)
FROM payment;
* postgresql://bob:***@localhost:5432/dvdrental
1 rows affected.
count |
---|
14596 |
Now, let’s group the payments by customer_id
:
%%sql
SELECT count(customer_id) AS payments_by_customer
FROM payment
GROUP BY customer_id
LIMIT 10;
* postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
payments_by_customer |
---|
30 |
26 |
24 |
22 |
35 |
25 |
28 |
23 |
20 |
24 |
We had to limit the amount of rows to 10, but basically what the query above is doing, it’s grouping all the payments by customer id. Without using the LIMIT
we’d get 599 rows (compare that number with the 14596 that we got at the beginning), with represent the amount of customers (the number represents how many payments each customer has done).
More examples#
The best way to understand how GROUP BY
works, it’s by practicing with examples.
Let’s say we want to find out our 10 best customers, meaning the ones that have spent the most:
%%sql
SELECT customer_id, sum(amount) AS moneys
FROM payment
GROUP BY customer_id
ORDER BY sum(amount) DESC
LIMIT 10;
* postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id | moneys |
---|---|
148 | 211.55 |
526 | 208.58 |
178 | 194.61 |
137 | 191.62 |
144 | 189.60 |
459 | 183.63 |
181 | 167.67 |
410 | 167.62 |
236 | 166.61 |
403 | 162.67 |
Now let’s say we want to find out who’s our best employee, meaning the one who’s processed the most amount of orders:
%%sql
SELECT staff_id, count(*)
FROM payment
GROUP BY staff_id;
* postgresql://bob:***@localhost:5432/dvdrental
2 rows affected.
staff_id | count |
---|---|
1 | 7292 |
2 | 7304 |
What if we’re curious about how many movies of each rating exists in the
film
table:
%%sql
SELECT rating, count(*) AS amount_of_films
FROM film
GROUP BY rating;
* postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
rating | amount_of_films |
---|---|
G | 178 |
PG | 194 |
PG-13 | 223 |
R | 195 |
NC-17 | 210 |
How many payments did each employee handle, and how much was the total amount processed by each of them?
%%sql
SELECT staff_id, count(*) AS payments, sum(amount) AS total
FROM payment
GROUP BY staff_id
ORDER BY total DESC;
* postgresql://bob:***@localhost:5432/dvdrental
2 rows affected.
staff_id | payments | total |
---|---|---|
2 | 7304 | 31059.92 |
1 | 7292 | 30252.12 |
Find out the average replacement cost of movies by rating. For example, R rated movies have an average replacement cost of $20.23.
%%sql
SELECT rating, round(avg(replacement_cost), 2) AS avg_replacement_cost
FROM film
GROUP BY rating;
* postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
rating | avg_replacement_cost |
---|---|
G | 20.12 |
PG | 18.96 |
PG-13 | 20.40 |
R | 20.23 |
NC-17 | 20.14 |
Find the top 5 customers, meaning the ones that have spent more moneys in our shop.
%%sql
SELECT customer_id, sum(amount) AS moneys
FROM payment
GROUP BY customer_id
ORDER BY moneys DESC
LIMIT 5;
* postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
customer_id | moneys |
---|---|
148 | 211.55 |
526 | 208.58 |
178 | 194.61 |
137 | 191.62 |
144 | 189.60 |