When you want to copy all the records from one table to another – for example, to change the data model or my favorite migrate data from a foreign table to a local PostgreSQL, you want to use multiple batches to avoid a long running transaction or the need to restart from the beginning in case of failure. This is a common challenge if you are copying data from one server to another (for example PROD to dev) or migrating data from another database (like Oracle or MySQL) to Postgres via an FDW.

Common Table Expressions (CTE’s) and a small amount of pl/pgSQL make it easy to code up a loop that will grab a batch of records, start after the maximum id and keep copying.

Setup some sample data

Let’s create a couple sample tables and fill the source with a few sample records:

CREATE TABLE my_source (id int, field1 int, field2 varchar(50));
CREATE TABLE my_dest (id int, field1 int, field2 varchar(50));
INSERT INTO my_source (SELECT generate_series(1, 4000, 1), generate_series(2,8000,2), 'some made up text');
select * from my_source limit 5;
 id | field1 |      field2       
----+--------+-------------------
  1 |      2 | some made up text
  2 |      4 | some made up text
  3 |      6 | some made up text
  4 |      8 | some made up text
  5 |     10 | some made up text
(5 rows)

Simple SQL Statement

The naive approach to copy over from source to dest is to just use a single SQL statement to INSERT INTO … SELECT like this:

INSERT INTO my_dest SELECT id, field1, field2 from my_source; 
INSERT 0 4000 

Works fine on this small sample data, but not for the 100 million row Oracle table that you’re trying to copy over the network with a few data transformations in the middle. It is better to copy the records over in batches so that you can commit every few thousand or million records based on the size of your server and tables.

Better with Batches

Same sample tables. Start fresh for the example:

TRUNCATE TABLE my_dest;
TRUNCATE TABLE

Now, use a CTE that inserts a batch of 100 records and select the max(ID) to use for the next batch. As a simple example, use a DO statement to loop gently:

DO LANGUAGE plpgsql $$
DECLARE max_id INTEGER = 0;
BEGIN
    WHILE max_id > -1 LOOP
        WITH insertrows AS (
          INSERT INTO my_dest
          SELECT * from my_source
          WHERE id > max_id
          ORDER BY id
          LIMIT 100
          RETURNING id
        )
        SELECT max(id) INTO max_id from insertrows;
        RAISE NOTICE '%', max_id;
    END LOOP;
END$$;
NOTICE:  100
NOTICE:  200
NOTICE:  300
...
NOTICE:  3900
NOTICE:  4000
NOTICE:  <NULL>
DO

Notes and Caveats

DON’T “DO” THIS. The DO block example is just an example – it will still be executed as a single transaction, so you lose the restart capability. Use an external program to loop and issue separate statements. (Or, use autonomous background transactions, but’s that’s another post).

If the external process fails partway through, you will need to query the max(id) from the destination table to figure out where to restart from. If you expect failures (remote network link, etc), or if you want to show progress during a long run, you may want to store the max_id value to a table or write it to a log file.

Getting the max_id and filtering the next batch is much better than using a fixed number of rows and trying to use OFFSET to skip to the next batch. OFFSET actually requires the database to retrieve and count the rows it is skipping rather than just “jumping” ahead.

The batching works best with a unique primary key as the field to filter on, since you know there are not duplicates and there is already an index available. If you are batching on some other field, make sure there is an index otherwise the filter will do a sequence scan on the source and be slower than you want.

Don’t leave that Raise Notice in for your production code 🙂

Happy Batching!