More and more I’m seeing people use a NUMERIC for their primary keys. This is a direct result of people leveraging automated tools for their Oracle to PostgreSQL migration. Oracle’s NUMBER data type is frequently used as the primary key for a table so the tools just map to a PostgreSQL NUMERIC and calls it a day. The PostgreSQL NUMERIC is a purpose built data type for arbitrary precision numbers where exactness is required, think monetary values. It can hold up to 131072 digits before the decimal point and up to 16383 after the decimal point. To me, that does not sound like the right choice for the surrogate key for your country code table. It really is more like using a cement truck for your daily commute. It gets the job done, but it is not even close to the most efficient way to do it.
Let’s take a look.
First, let’s create 2 tables with the only difference being the data type of the primary key.
test=# \d bar_int Table "public.bar_int" Column | Type | Collation | Nullable | Default --------+-------------------+-----------+----------+--------- a | integer | | not null | b | character varying | | | Indexes: "bar_int_pkey" PRIMARY KEY, btree (a) test=# \d bar_numeric Table "public.bar_numeric" Column | Type | Collation | Nullable | Default --------+-------------------+-----------+----------+--------- a | numeric | | not null | b | character varying | | | Indexes: "bar_numeric_pkey" PRIMARY KEY, btree (a)
Then, let’s fill those tables with 10 million rows of data.
test=# INSERT INTO bar_int test-# SELECT a, repeat('x', 100) test-# FROM generate_series(1, 100000000) a;
Looking at the sizes on disk of those 2 tables, they both take up exactly the same amount of space. When people are trying to decide if a NUMERIC is safe choice for their primary keys, many times this is where they stop the analysis. The storage difference is irrelevant or non-existent so it must be a good choice. We know it’s not, so let’s dig deeper.
test=# SELECT relname, pg_total_relation_size(relid) test-# FROM pg_stat_user_tables WHERE relname LIKE 'bar%' test-# ORDER BY 1; relname | pg_total_relation_size -------------+------------------------ bar_int | 16373833728 bar_numeric | 16373833728 (2 rows)
In a transactional system, the primary key is used to return a single row so let’s look at the difference NUMERIC makes for this use case. Creating a simple script to look up a random row by the primary key and running it via pgbench shows the performance loss of NUMERIC.
\set id random(1, 10000000) SELECT b FROM bar_int WHERE a = :id; pgbench -n -f number_test -c 4 -t 100000 test
Integer: 27105 tps Numeric: 25823 tps
That is a 4.7% performance loss by using NUMERIC instead of an integer. For some people who want to squeeze out every last bit of performance out of the system they can, this is enough to not use NUMERIC, but for many it is not. The convenience of using a NUMERIC is well worth the loss of 5% on queries that run in milliseconds. But let’s keep digging.
A very common first migration project people undertake is a reporting database. In many ways, it is a perfect choice as people become comfortable with PostgreSQL since frequently the uptime requirements are not as stringent as a transactional system. Let’s look at a very basic reporting type query joining 2 tables by their primary keys.
SELECT count(*) FROM foo_int a, bar_int b WHERE a.a = b.a;
This is where the performance loss of NUMERIC will change everyone’s mind about using them as primary keys.
Average Query Time: Integer: 5637 ms Numeric: 9216 ms
The query just using NUMERIC instead of integers as the primary key performs nearly 39% slower. And that is just a simple case. If you start summing or averaging the values which is frequently done in reports, the performance loss easily tops 40%.
While it is easy to just map an Oracle NUMBER to a PostgreSQL NUMERIC, take the time to map things to the correct data type and save that 40% of performance.