PostgreSQL by BigSQL is committed to making Postgres user and developer friendly. As part of this effort, we continue to extend the list of extensions we ship and support. Recently, we started supporting pg_bulkload, a high speed data loading utility for PostgreSQL.
In this blog post, we will illustrate how to quickly install and use the pg_bulkload extension. Installation is pretty straightforward. First, list the available extensions:
$ ./pgc list
And install the relevant bulkload package:
$ ./pgc install bulkload3-pg96
After installation, execute the environment script to set the environment variables:
$ . ./pg96/pg96.env
Setting the environment variables and .pgpass is important. Because pg_bulkload uses this nformation for connecting to the database server.
Now we are ready to create an extension:
postgres=# create extension pg_bulkload; CREATE EXTENSION
Next, create a target table to which we can load the data:
CREATE TABLE c_review ( customer_id TEXT, review_date DATE, review_rating INTEGER, review_votes INTEGER, review_helpful_votes INTEGER, product_id CHAR(10), product_title TEXT, product_sales_rank BIGINT, product_group TEXT, product_category TEXT, product_subcategory TEXT, similar_product_ids CHAR(10) );
Create a control file. Pg_bulkload uses this file as the spec file for data loading:
Add following lines to the file:
OUTPUT = c_review # [<schema_name>.]table_name INPUT = /home/vagrant/bigsql/customer_reviews_1998.csv # Input data location (absolute path) TYPE = CSV # Input file type QUOTE = "\"" # Quoting character ESCAPE = \ # Escape character for Quoting DELIMITER = "," # Delimiter
Now we can execute the pg_bulkload by providing the control file in parameter.
$ pg_bulkload sample_csv.ctl NOTICE: BULK LOAD START NOTICE: BULK LOAD END 0 Rows skipped. 589859 Rows successfully loaded. 0 Rows not loaded due to parse errors. 0 Rows not loaded due to duplicate errors. 0 Rows replaced with new rows.
Once data is loaded, we can verify it again:
postgres=# select count(*) from c_review; count 589859 (1 row)
Pg_bulkload also supports parallel loading – you can enable this by adding a line to sample_csv.ctl:
“WRITER = PARALLEL”
In my next post I’ll show the performance improvement using the parallel option.