import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://bob:1234@localhost:5432/dvdrental')
engine.execution_options(isolation_level="AUTOCOMMIT")
%load_ext sql
%sql $engine.url
'Connected: bob@dvdrental'

The GROUP BY clause#

The GROUP BY clause is used to group together the rows in returned by the SELECT statement into groups. That’s useful because we can apply to each group an aggregate function, such as calculate the sum of items in a group, and count the number of items in another group. The basic syntax looks like this:

SELECT column_1, aggregate_function(column_2)
FROM weather_table
GROUP BY column_1;

To understand how this works, let’s take a peek into the payment table:

%%sql
SELECT *
FROM payment
ORDER BY customer_id ASC
LIMIT 5;
 * postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
payment_id customer_id staff_id rental_id amount payment_date
18495 1 1 1185 5.99 2007-02-14 23:22:38.996577
18496 1 2 1422 0.99 2007-02-15 16:31:19.996577
18497 1 2 1476 9.99 2007-02-15 19:37:12.996577
18498 1 1 1725 4.99 2007-02-16 13:47:23.996577
18499 1 1 2308 4.99 2007-02-18 07:10:14.996577

As you can see, the customer with id 1 has made several payments of different amount. What if we wanted to find out the average amount paid by the first 10 customers:

%%sql
SELECT customer_id, avg(amount) AS average_amount
FROM payment
GROUP BY customer_id
LIMIT 10;
 * postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id average_amount
1 3.8233333333333333
2 4.7592307692307692
3 5.4483333333333333
4 3.7172727272727273
5 3.8471428571428571
6 3.3900000000000000
7 4.6685714285714286
8 3.7291304347826087
9 3.9400000000000000
10 3.9483333333333333

As you can see, we’ve use the AS clause to set up a column alias for the value return by the aggregate function.

Using GROUP BY without an aggregate function#

Even though most of the times we’ll be using GROUP BY with an aggregate function, let’s take a look at how to use it without one, to better understand what it does. As we saw before, each customer has several payments associated with her id. For example, let’s count the amount of rows in the payment table:

%%sql
SELECT count(customer_id)
FROM payment;
 * postgresql://bob:***@localhost:5432/dvdrental
1 rows affected.
count
14596

Now, let’s group the payments by customer_id:

%%sql
SELECT count(customer_id) AS payments_by_customer
FROM payment
GROUP BY customer_id
LIMIT 10;
 * postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
payments_by_customer
30
26
24
22
35
25
28
23
20
24

We had to limit the amount of rows to 10, but basically what the query above is doing, it’s grouping all the payments by customer id. Without using the LIMIT we’d get 599 rows (compare that number with the 14596 that we got at the beginning), with represent the amount of customers (the number represents how many payments each customer has done).

More examples#

The best way to understand how GROUP BY works, it’s by practicing with examples.

  1. Let’s say we want to find out our 10 best customers, meaning the ones that have spent the most:

%%sql
SELECT customer_id, sum(amount) AS moneys
FROM payment
GROUP BY customer_id
ORDER BY sum(amount) DESC 
LIMIT 10;
 * postgresql://bob:***@localhost:5432/dvdrental
10 rows affected.
customer_id moneys
148 211.55
526 208.58
178 194.61
137 191.62
144 189.60
459 183.63
181 167.67
410 167.62
236 166.61
403 162.67
  1. Now let’s say we want to find out who’s our best employee, meaning the one who’s processed the most amount of orders:

%%sql
SELECT staff_id, count(*)
FROM payment
GROUP BY staff_id;
 * postgresql://bob:***@localhost:5432/dvdrental
2 rows affected.
staff_id count
1 7292
2 7304
  1. What if we’re curious about how many movies of each rating exists in the film table:

%%sql
SELECT rating, count(*) AS amount_of_films
FROM film
GROUP BY rating;
 * postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
rating amount_of_films
G 178
PG 194
PG-13 223
R 195
NC-17 210
  1. How many payments did each employee handle, and how much was the total amount processed by each of them?

%%sql
SELECT staff_id, count(*) AS payments, sum(amount) AS total
FROM payment
GROUP BY staff_id
ORDER BY total DESC;
 * postgresql://bob:***@localhost:5432/dvdrental
2 rows affected.
staff_id payments total
2 7304 31059.92
1 7292 30252.12
  1. Find out the average replacement cost of movies by rating. For example, R rated movies have an average replacement cost of $20.23.

%%sql
SELECT rating, round(avg(replacement_cost), 2) AS avg_replacement_cost
FROM film
GROUP BY rating;
 * postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
rating avg_replacement_cost
G 20.12
PG 18.96
PG-13 20.40
R 20.23
NC-17 20.14
  1. Find the top 5 customers, meaning the ones that have spent more moneys in our shop.

%%sql
SELECT customer_id, sum(amount) AS moneys
FROM payment
GROUP BY customer_id
ORDER BY moneys DESC
LIMIT 5;
 * postgresql://bob:***@localhost:5432/dvdrental
5 rows affected.
customer_id moneys
148 211.55
526 208.58
178 194.61
137 191.62
144 189.60