The PostgreSQL storage manager is tasked with the complex job of satisfying ACID compliance. This is a well documented series of algorithms that I won’t go into here. Like any storage system or database, PostgreSQL tables can start to take on free space (bloat as it is sometimes known).  While all free space is not a bad thing, there are some situations where it can get unwieldy.

At a high-level:

  • During an UPDATE or DELETE in postgres, the row is not removed. It is marked as ‘free space’ for re-use.
  • If there are no inbound rows that fit into that free space (or you have no new rows), the ‘dead rows’ that were marked as free are now taking up disk space.

Anecdotally, I’ve seen a table using 100GB of disk space, with only a few thousand rows totaling less than 1 GB. This was a queue used by an old version of the quartz scheduler. The table was high volume, pure INSERT & DELETE. There was never any opportunity for dead space to be found and re-used.

So, we’ve identified traffic patterns with high-volume UPDATE & DELETE that can cause bloat, but, how do I actually monitor for it?

There’s really two schools of though here:

  1. Deep scan table, read all live & dead rows
  2. Catalog-based estimates

Is bloat really bad?

Before moving on to finding bloat, let’s understand the implications. If you’ve spent a lot of time with DBAs, you’ve probably heard them discussing free space. Free space is disk space that has been allocated, and is available for use. This type of free space is very efficient to use vs. allocating new blocks. Our goal isn’t to eliminate all bloat. The goal is to eliminate excessive bloat. The queries here are going to either definitively find or estimate the amount of available space in a relation. I don’t recommend going on a quixotic quest against all of your built-up free space. Just eliminate the extra bloat.

Deep scan

The first strategy is going to be the most accurate method of determining bloat. You’ll be able to see exactly how much bloat is in a table. There are a few issues with this strategy:

  • Deep scan of relation takes time
  • Deep scan demands I/O from server

In other words: It’s going to take a long time and most likely impact your system performance.

I’m not going to deep-dive on this option today, if you’re interested, have a look at the pgstattuple module. (NB: pgstattuple also has a pgstattuple_approx) function in the latest versions. Although it does approximate the bloat (making a deep-scan slightly faster), it is still scanning the relation.

Catalogs

Postgres gathers statistics about tables and indexes in order to query them efficiently (This is done via ‘ANALYZE’, discussion for another day). A few well-placed queries can use these statistics to estimate the amount of bloat in a table. Since we are using the catalogs, no deep-scan is required. The downside is that we won’t have 100% accurate statistics, this is only an estimate.

I prefer a strategy that allows me to quickly generate a list of potential problems before digging in further. Essentially, I start with an estimate then drill-in looking for more detail.

There are two very popular bloat estimation queries available to us. The most well-known was developed as part of check_postgres.pl (PostgreSQL plugin for Nagios). Personally, I’ve also found the results of that a bit harder to parse manually.

My current favorite query gives a nice, human-readable overview of potential bloat issues (from: https://github.com/ioguix/pgsql-bloat-estimation):

/* WARNING: executed with a non-superuser role, the query inspect only tables you are granted to read.
* This query is compatible with PostgreSQL 9.0 and more
<em>/
SELECT current_database(), schemaname, tblname, bs</em>tblpages AS real_size,
 (tblpages-est_tblpages)<em>bs AS extra_size,
 CASE WHEN tblpages - est_tblpages &gt; 0
 THEN 100 * (tblpages - est_tblpages)/tblpages::float
 ELSE 0
 END AS extra_ratio, fillfactor, (tblpages-est_tblpages_ff)</em>bs AS bloat_size,
 CASE WHEN tblpages - est_tblpages_ff &gt; 0
 THEN 100 * (tblpages - est_tblpages_ff)/tblpages::float
 ELSE 0
 END AS bloat_ratio, is_na
 -- , (pst).free_percent + (pst).dead_tuple_percent AS real_frag
FROM (
 SELECT ceil( reltuples / ( (bs-page_hdr)/tpl_size ) ) + ceil( toasttuples / 4 ) AS est_tblpages,
 ceil( reltuples / ( (bs-page_hdr)<em>fillfactor/(tpl_size</em>100) ) ) + ceil( toasttuples / 4 ) AS est_tblpages_ff,
 tblpages, fillfactor, bs, tblid, schemaname, tblname, heappages, toastpages, is_na
 -- , stattuple.pgstattuple(tblid) AS pst
 FROM (
 SELECT
 ( 4 + tpl_hdr_size + tpl_data_size + (2*ma)
 - CASE WHEN tpl_hdr_size%ma = 0 THEN ma ELSE tpl_hdr_size%ma END
 - CASE WHEN ceil(tpl_data_size)::int%ma = 0 THEN ma ELSE ceil(tpl_data_size)::int%ma END
 ) AS tpl_size, bs - page_hdr AS size_per_block, (heappages + toastpages) AS tblpages, heappages,
 toastpages, reltuples, toasttuples, bs, page_hdr, tblid, schemaname, tblname, fillfactor, is_na
 FROM (
 SELECT
 tbl.oid AS tblid, ns.nspname AS schemaname, tbl.relname AS tblname, tbl.reltuples,
 tbl.relpages AS heappages, coalesce(toast.relpages, 0) AS toastpages,
 coalesce(toast.reltuples, 0) AS toasttuples,
 coalesce(substring(
 array_to_string(tbl.reloptions, ' ')
 FROM '%fillfactor=#&quot;__#&quot;%' FOR '#')::smallint, 100) AS fillfactor,
 current_setting('block_size')::numeric AS bs,
 CASE WHEN version()~'mingw32' OR version()~'64-bit|x86_64|ppc64|ia64|amd64' THEN 8 ELSE 4 END AS ma,
 24 AS page_hdr,
 23 + CASE WHEN MAX(coalesce(null_frac,0)) &gt; 0 THEN ( 7 + count(*) ) / 8 ELSE 0::int END
 + CASE WHEN tbl.relhasoids THEN 4 ELSE 0 END AS tpl_hdr_size,
 sum( (1-coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 1024) ) AS tpl_data_size,
 bool_or(att.atttypid = 'pg_catalog.name'::regtype) AS is_na
 FROM pg_attribute AS att
 JOIN pg_class AS tbl ON att.attrelid = tbl.oid
 JOIN pg_namespace AS ns ON ns.oid = tbl.relnamespace
 JOIN pg_stats AS s ON s.schemaname=ns.nspname
 AND s.tablename = tbl.relname AND s.inherited=false AND s.attname=att.attname
 LEFT JOIN pg_class AS toast ON tbl.reltoastrelid = toast.oid
 WHERE att.attnum &gt; 0 AND NOT att.attisdropped
 AND tbl.relkind = 'r'
 GROUP BY 1,2,3,4,5,6,7,8,9,10, tbl.relhasoids
 ORDER BY 2,3
 ) AS s
 ) AS s2
) AS s3;
-- WHERE NOT is_na
-- AND tblpages*((pst).free_percent + (pst).dead_tuple_percent)::float4/100 &gt;= 1

 

As I said, the complexity of these queries is high. Let’s take a look at the output ( as a picture because text wraps)

Bloat Out

 current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na ------------------+------------+------------------------+------------+------------+-------------------+------------+--------------+-------------------+------- postgres | public | pgbench_accounts | 1411891200 | 39084032 | 2.76820423556716 | 10 | -12316139520 | 0 | f postgres | public | pgbench_branches | 1105920 | 1097728 | 99.2592592592593 | 10 | 1064960 | 96.2962962962963 | f postgres | public | pgbench_history | 16867328 | 73728 | 0.437105390966489 | 100 | 73728 | 0.437105390966489 | f postgres | public | pgbench_tellers | 21422080 | 21372928 | 99.7705544933078 | 10 | 20979712 | 97.9349904397706 | f postgres | public | t_stats | 32768 | 0 | 0 | 100 | 0 | 0 | f postgres | snapshots | snap | 65536 | 0 | 0 | 100 | 0 | 0 | f postgres | snapshots | snap_databases | 2424832 | 327680 | 13.5135135135135 | 100 | 327680 | 13.5135135135135 | t postgres | snapshots | snap_indexes | 9330688 | 327680 | 3.51185250219491 | 100 | 327680 | 3.51185250219491 | t postgres | snapshots | snap_pg_locks | 5980160 | 483328 | 8.08219178082192 | 100 | 483328 | 8.08219178082192 | f postgres | snapshots | snap_settings | 24576 | 0 | 0 | 100 | 0 | 0 | f postgres | snapshots | snap_stat_activity | 1449984 | 65536 | 4.51977401129944 | 100 | 65536 | 4.51977401129944 | t postgres | snapshots | snap_statio_all_tables | 29868032 | 974848 | 3.26385079539221 | 100 | 974848 | 3.26385079539221 | t postgres | snapshots | snap_user_tables | 5472256 | 270336 | 4.94011976047904 | 100 | 270336 | 4.94011976047904 | t 

What we get here is a very nice overview that shows us (all sizes in bytes) the real_size of the relation, how much is extra and what the ratio of extra is. The last column, is_na (is not applicable) is important. Essentially, this column is true if you’re using the ‘name’ datatype. The ‘name’ datatype throws off the bloat estimation and can give you invalid data.

Since these are estimates, we have to take them with a grain of salt. You probably want to compare it to your usage rates from pg_stat_user_tables if you’re not familiar with the workload already.

Reclaiming space, that’s a topic for another day, but, take a look at VACUUM FULL (requies full table lock for compaction) or pg_repack (extension that reclaims space with minimal locking).

Now you have some information about your bloat. At this point, you can drill in with pgstattuple, or begin targeting some suspects with more aggressive vacuums / or move to compaction.

The other (before mentioned) query that is very popular is from the popular check_postgres.pl script for Nagios:

 SELECT current_database() AS db, schemaname, tablename, reltuples::bigint AS tups, relpages::bigint AS pages, otta, ROUND(CASE WHEN otta=0 OR sml.relpages=0 OR sml.relpages=otta THEN 0.0 ELSE sml.relpages/otta::numeric END,1) AS tbloat, CASE WHEN relpages &lt; otta THEN 0 ELSE relpages::bigint - otta END AS wastedpages, CASE WHEN relpages &lt; otta THEN 0 ELSE bs<em>(sml.relpages-otta)::bigint END AS wastedbytes, CASE WHEN relpages &lt; otta THEN '0 bytes'::text ELSE (bs</em>(relpages-otta))::bigint || ' bytes' END AS wastedsize, iname, ituples::bigint AS itups, ipages::bigint AS ipages, iotta, ROUND(CASE WHEN iotta=0 OR ipages=0 OR ipages=iotta THEN 0.0 ELSE ipages/iotta::numeric END,1) AS ibloat, CASE WHEN ipages &lt; iotta THEN 0 ELSE ipages::bigint - iotta END AS wastedipages, CASE WHEN ipages &lt; iotta THEN 0 ELSE bs<em>(ipages-iotta) END AS wastedibytes, CASE WHEN ipages &lt; iotta THEN '0 bytes' ELSE (bs</em>(ipages-iotta))::bigint || ' bytes' END AS wastedisize, CASE WHEN relpages &lt; otta THEN CASE WHEN ipages &lt; iotta THEN 0 ELSE bs<em>(ipages-iotta::bigint) END ELSE CASE WHEN ipages &lt; iotta THEN bs</em>(relpages-otta::bigint) ELSE bs<em>(relpages-otta::bigint + ipages-iotta::bigint) END END AS totalwastedbytes FROM (SELECT nn.nspname AS schemaname, cc.relname AS tablename, COALESCE(cc.reltuples, 0) AS reltuples, COALESCE(cc.relpages, 0) AS relpages, COALESCE(bs, 0) AS bs, COALESCE(CEIL((cc.reltuples</em>((datahdr+ma- (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)),0) AS otta, COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages, COALESCE(CEIL((c2.reltuples<em>(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes ALL cols FROM pg_class cc JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname &lt;&gt; 'information_schema' LEFT JOIN (SELECT ma, bs, foo.nspname, foo.relname, (datawidth+(hdr+ma-(case WHEN hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr, (maxfracsum</em>(nullhdr+ma-(case WHEN nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2 FROM (SELECT ns.nspname, tbl.relname, hdr, ma, bs, SUM((1-coalesce(null_frac, 0))<em>coalesce(avg_width, 2048)) AS datawidth, MAX(coalesce(null_frac, 0)) AS maxfracsum, hdr+ (SELECT 1+count(</em>)/8 FROM pg_stats s2 WHERE null_frac&lt;&gt;0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname ) AS nullhdr FROM pg_attribute att JOIN pg_class tbl ON att.attrelid = tbl.oid JOIN pg_namespace ns ON ns.oid = tbl.relnamespace LEFT JOIN pg_stats s ON s.schemaname=ns.nspname AND s.tablename = tbl.relname AND s.inherited=false AND s.attname=att.attname, (SELECT (SELECT current_setting('block_size')::numeric) AS bs, CASE WHEN SUBSTRING(SPLIT_PART(v, ' ', 2) FROM '#&quot;[0-9]+.[0-9]+#&quot;%' for '#') IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr, CASE WHEN v ~ 'mingw32' OR v ~ '64-bit' THEN 8 ELSE 4 END AS ma FROM (SELECT version() AS v) AS foo ) AS constants WHERE att.attnum &gt; 0 AND tbl.relkind='r' GROUP BY 1,2,3,4,5 ) AS foo ) AS rs ON cc.relname = rs.relname AND nn.nspname = rs.nspname LEFT JOIN pg_index i ON indrelid = cc.oid LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid ) AS sml;

Happy PostgreSQL-ing!