Understanding PostgreSQL crosstab

This content originally appeared on DEV Community and was authored by Marat Latypov

Usually piece of data in SQL is represented as a row in a table. Often it's convenient to represent it as a cell in pivot table. Crosstab could help us in it.

Let's see how it works.

Preparing data

First of all let's prepare data. Here I created tables a and b, and so-called join table c, which is related to a and b.

create sequence myseq start 1;
create table a
(
    id   bigint default nextval('myseq'::regclass) not null primary key,
    name text NOT NULL
);

create table b
(
    id   bigint default nextval('myseq'::regclass) not null primary key,
    name text NOT NULL
);

create table c
(
    a_id bigint not null references a(id),
    b_id bigint not null references b(id),
    value text
);

Now let's seed tables with data.

insert into a (name) values ('a1'), ('a2'), ('a3');
-- Result:
-- +--+----+
-- |id|name|
-- +--+----+
-- |1 |a1  |
-- |2 |a2  |
-- |3 |a3  |
-- +--+----+

insert into b (name) values ('b1'), ('b2'), ('b3');
-- Result:
-- +--+----+
-- |id|name|
-- +--+----+
-- |4 |b1  |
-- |5 |b2  |
-- |6 |b3  |
-- +--+----+

I used one shared increment sequence for both tables, which is why the id in table b starts from 4 instead of 1.

Just keep this in mind.

Next let's populate the join table c.

insert into c (a_id, b_id, value) values
(1, 4, 'v14'),
(1, 5, 'v15'),
(1, 6, 'v16'),

(2, 4, 'v24'),
(2, 5, 'v25'),
(2, 6, 'v26'),

(3, 4, 'v34'),
(3, 5, 'v35'),
(3, 6, 'v36');

Case #0 Cartesian product

Let's make a simple inner join.

select 
    a.name as a_name, 
    b.name as b_name, 
    c.value
from c
inner join a on a.id = c.a_id
inner join b on b.id = c.b_id
order by 1, 2;
-- Result:
-- +------+------+-----+
-- |a_name|b_name|value|
-- +------+------+-----+
-- |a1    |b1    |v14  |
-- |a1    |b2    |v15  |
-- |a1    |b3    |v16  |
-- |a2    |b1    |v24  |
-- |a2    |b2    |v25  |
-- |a2    |b3    |v26  |
-- |a3    |b1    |v34  |
-- |a3    |b2    |v35  |
-- |a3    |b3    |v36  |
-- +------+------+-----+

In the output result we see all the data in rows. Each row is a piece of information.

There is nothing special here

Case #1 Our pivot table using `crosstab`

Now let's make a pivot table from it

select *
from crosstab(
    'select
        a.name as a_name,
        b.name as b_name,
        c.value
    from c
    inner join a on a.id = c.a_id
    inner join b on b.id = c.b_id
    order by 1',
   'select name from b order by 1'
) as crosstab_alias (a_name text, b1 text, b2 text, b3 text);

-- Result:
-- +------+---+---+---+
-- |a_name|b1 |b2 |b3 |
-- +------+---+---+---+
-- |a1    |v14|v15|v16|
-- |a2    |v24|v25|v26|
-- |a3    |v34|v35|v36|
-- +------+---+---+---+

Wow! This result looks much better!

Let's look at the query

crosstab(text source_sql, text category_sql)

Crosstab function produces table and have 2 parameters : source_sql and category_sql.

source_sql - SQL used to retrieve the source data
category_sql - SQL used to retrieve categories

Source data has some restrictions:

The first column must be the row identifier (in our example a_name). It should be unique for each row in pivot table.
After row identifier column could be some extra fields. In our example we don't have them.
The next mandatory column is the category identifier (in our example b_name). It should be unique for each column in pivot table.
The last mandatory column is the value (in our example value). It can be any data type.

One more important thing is table alias and column aliases in our example it's the line

as crosstab_alias (a_name text, b1 text, b2 text, b3 text)

The alias crosstab_alias is quite common thing in SELECT queries. It's nothing special — aliases are often assigned to tables in queries for easier reference later on.

But in our case it's vital, because table produced by crosstab should have some name!

The next mandatory thing is column description with types. So it should contain:

row identifier (for ex. a_name text)
optional extra fields, if you have some
categories columns (for ex. b1 text, b2 text, b3 text)

That's it!

Case #2 Not existent data in pivot table

Now let's play with the data. What if we don't have full table?

delete from c; --delete previous data
insert into c (a_id, b_id, value) values 
   (1, 4, 'v14'),
-- (1, 5, 'v15'), removed line
   (1, 6, 'v16'),
   (2, 4, 'v24'),
   (2, 5, 'v25'),
-- (2, 6, 'v26'), removed line
-- (3, 4, 'v34'), removed line
   (3, 5, 'v35'),
   (3, 6, 'v36');

When we run our previous crosstab query on such data, we will get:

-- Result:
-- +------+----+----+----+
-- |a_name|b1  |b2  |b3  |
-- +------+----+----+----+
-- |a1    |v14 |null|v16 |
-- |a2    |v24 |v25 |null|
-- |a3    |null|v35 |v36 |
-- +------+----+----+----+

So we see now empty slots with no data are replaced with nulls!

Looks great for me!

Case #3 Duplicated data in pivot table

delete from c; --delete previous data if you need
insert into c (a_id, b_id, value) values
       (1, 4, 'v14'),
       (1, 5, 'v15'),
       (1, 6, 'v16'),
       (1, 6, 'v16 duplicate'),
       (2, 4, 'v24'),
       (2, 5, 'v25'),
       (2, 6, 'v26'),
       (3, 4, 'v34'),
       (3, 5, 'v35'),
       (3, 6, 'v36');

What does mean this duplicated data entry in our case, it looks like a nonsense.

But assume we have something like that and let's see what will happen.

Our crosstab query will return result:

-- Result:
-- +------+---+---+-------------+
-- |a_name|b1 |b2 |b3           |
-- +------+---+---+-------------+
-- |a1    |v14|v15|v16 duplicate|
-- |a2    |v24|v25|v26          |
-- |a3    |v34|v35|v36          |
-- +------+---+---+-------------+

Ouch! From two data entries for relation a1-b3 only the latest is shown. Why the latest? No idea, I don't know, it's on PostgreSQL implementation. Maybe it uses the latest value, maybe chooses randomly.
Anyway it sounds bad.

You shouldn't have any duplicated entries in you table. Maybe entries should be aggregated somehow.

Preparing data

Case #0 Cartesian product

Case #1 Our pivot table using crosstab

Case #2 Not existent data in pivot table

Case #3 Duplicated data in pivot table

Links

Related Posts

Case #1 Our pivot table using `crosstab`