Atomic UPSERT with unique constraint on null-able column in PostgreSQL



This post will be probably be very boring for everyone who doesn't have this problem. But since it cost me some time to figure out a solution, I think it is worth sharing.

Problem Statement

Let's consider a PostgreSQL table called foo with an ID column and a unique null-able column called bar:
CREATE TABLE foo (
    id serial PRIMARY KEY,
    bar int UNIQUE
);
Our goal is to perform an operation on that table that given some value $X$ for bar 1. in case $X$ inserts a row with bar set to $X$ and returns the newly inserted ID in case $X$ was not in the table 2. or that otherwise returns the ID of the row where bar carries the value $X$. This is a special case of a so called UPSERT[1] where the actual update should not change anything.

Returning the ID

PostgreSQL supports returning a newly inserted ID through the RETURNING construct[2]:
INSERT INTO foo (bar) VALUES(42) RETURNING id; -- returns newly inserted id (1 for example)
INSERT INTO foo (bar) VALUES(23) RETURNING id; -- returns newly inserted id (2 for example)
INSERT INTO foo (bar) VALUES(23) RETURNING id; -- this throws an error as expected

Handle Conflicts

We now want to formulate a query that either inserts 23 and returns the ID $2$ or that just returns the ID of 23 if 23 was already present. PostgreSQL supports this through the ON CONFLICT construct. Let's try
INSERT INTO foo (bar) VALUES(23) ON CONFLICT(bar) DO NOTHING RETURNING id;
This correctly returns the newly inserted ID in case $23$ wasn't in the table before but doesn't return anything if it was. Since RETURNING only returns stuff of rows that have been inserted or updated, we need to simulate an update in case the value already exists:
INSERT INTO foo (bar) VALUES(23) ON CONFLICT(bar) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
This correctly returns $2$. Please note that one needs to specify on which column the conflict happens now: For DO NOTHING this was optional but according to the docs DO UPDATE requires to specify this explicitly.

The Billion Dollars Mistake

So far so good. Now a well-known[3] detail of SQL is that unique constraints do not consider two different NULL values equal. So the following insert commands lead to two new rows in the table which have the same value in the bar column (at least for some definition of "the same"):
INSERT INTO foo (bar) VALUES(NULL);
INSERT INTO foo (bar) VALUES(NULL);

SELECT * FROM foo;
 id | bar
----+-----
  1 |  42
  2 |  23
  3 |  NULL
  4 |  NULL
In my use case there should at most be one row with a value of NULL. Sadly, there doesn't seem to be a satisfying solution for this. The best thing I could come up with requires that there is a special value in the range of bar that doesn't occur in the column at all. If bar models the age of a person for example -1 may be such a special value. If there is no such value in your range, you are screwed I think. If I am mistaken, please let me know in the comments below! So let's assume the value 0 is special for the bar column in that sense that bar never takes 0 as a value. Sadly, we cannot use the UNIQUE shortcut during table creation any more but need to define the index separately:
CREATE TABLE foo (
    id serial PRIMARY KEY,
    bar int CHECK(bar != 0) -- let's at least enforce that the special value does not appear
);
CREATE UNIQUE INDEX foo_unique_idx ON foo (COALESCE(bar, 0));
Executing the following insert
INSERT INTO foo (bar) VALUES(42) ON CONFLICT(bar) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
of course causes an error since there is no index on the column bar anymore but on something else (namely either the value of bar or $0$ if it is NULL). According to the documentation ON CONFLICT accepts a constraint name. Sadly technically, an index isn't a constraint. If you are using PostgrSQL version 9.5.3 or lower, you are screwed again now. Because in those versions ON CONFLICT doesn't accept arbitrary expression.

UPSERT at Last

If you can update to more recent version, the following inserts correctly returns three different ids:
INSERT INTO foo (bar) VALUES(42) ON CONFLICT(COALESCE(bar, 0)) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
INSERT INTO foo (bar) VALUES(23) ON CONFLICT(COALESCE(bar, 0)) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
INSERT INTO foo (bar) VALUES(23) ON CONFLICT(COALESCE(bar, 0)) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
INSERT INTO foo (bar) VALUES(NULL) ON CONFLICT(COALESCE(bar, 0)) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
INSERT INTO foo (bar) VALUES(NULL) ON CONFLICT(COALESCE(bar, 0)) DO UPDATE SET bar=EXCLUDED.bar RETURNING id;
That's it. Hope it helps someone out there.

Last Caveat

The way PostgreSQL handles upserts implemented with ON CONFLICT leads to the sequence corresponding to the ID column increasing even in the conflict (and update) case. So this technique may not be feasible in cases where successful inserts happen rarely but queries like above are executed rapidly.
  1. combination of "INSERT" and "UPDATE" [back]
  2. There are other ways to accomplish this. Namely curval and LASTVAL but they are not feasible in the general case where triggers may insert additional rows (in the same table). [back]
  3. not to me of course [back]

Tags: - -

Leave a Reply

Your email address will not be published. Required fields are marked *